BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset

View on arXiv ← Back to list

Authors: Istiaq Ahmed Fahad, Kamruzzaman Asif, Sifat Sikder

Published: 2025-05-16 05:42:25+00:00

AI Summary

This paper introduces BanglaFake, a new Bengali deepfake audio dataset containing 12,260 real and 13,260 deepfake utterances generated using a state-of-the-art TTS model. The dataset's quality is evaluated through qualitative and quantitative analyses, showing high naturalness and intelligibility of the deepfakes, making it a valuable resource for deepfake detection research in low-resource languages.

Abstract

Deepfake audio detection is challenging for low-resource languages like Bengali due to limited datasets and subtle acoustic features. To address this, we introduce BangalFake, a Bengali Deepfake Audio Dataset with 12,260 real and 13,260 deepfake utterances. Synthetic speech is generated using SOTA Text-to-Speech (TTS) models, ensuring high naturalness and quality. We evaluate the dataset through both qualitative and quantitative analyses. Mean Opinion Score (MOS) from 30 native speakers shows Robust-MOS of 3.40 (naturalness) and 4.01 (intelligibility). t-SNE visualization of MFCCs highlights real vs. fake differentiation challenges. This dataset serves as a crucial resource for advancing deepfake detection in Bengali, addressing the limitations of low-resource language research.

Key findings

The generated deepfake audio in BanglaFake achieved a Robust-MOS of 3.40 for naturalness and 4.01 for intelligibility. t-SNE visualization revealed significant overlap between real and deepfake audio in the MFCC feature space, highlighting the challenge of detection. The dataset is publicly available on Hugging Face and GitHub.

Approach

The authors created the BanglaFake dataset by generating deepfake audio using a VITS-based Text-to-Speech (TTS) model trained on the SUST TTS Corpus and Mozilla Common Voice datasets. The quality of the generated audio was evaluated using Mean Opinion Score (MOS) and t-SNE visualization of MFCCs.

Datasets

SUST TTS Corpus, Mozilla Common Voice

Model(s)

VITS-based Text-to-Speech (TTS) model

Author countries

Bangladesh

← Previous