Zero-Shot to Zero-Lies: Detecting Bengali Deepfake Audio through Transfer Learning

Authors: Most. Sharmin Sultana Samu, Md. Rakibul Islam, Md. Zahid Hossain, Md. Kamrozzaman Bhuiyan, Farhad Uz Zaman

Published: 2025-12-25 14:53:40+00:00

Comment: Accepted for publication in 2025 28th International Conference on Computer and Information Technology (ICCIT)

AI Summary

This paper addresses the challenge of detecting Bengali deepfake audio, an area that is largely unexplored. The authors evaluate both zero-shot inference with several pretrained models and fine-tune multiple deep learning architectures on the BanglaFake dataset. They demonstrate that fine-tuning significantly improves detection performance over zero-shot methods, providing the first systematic benchmark for Bengali deepfake audio detection.

Abstract

The rapid growth of speech synthesis and voice conversion systems has made deepfake audio a major security concern. Bengali deepfake detection remains largely unexplored. In this work, we study automatic detection of Bengali audio deepfakes using the BanglaFake dataset. We evaluate zeroshot inference with several pretrained models. These include Wav2Vec2-XLSR-53, Whisper, PANNsCNN14, WavLM and Audio Spectrogram Transformer. Zero-shot results show limited detection ability. The best model, Wav2Vec2-XLSR-53, achieves 53.80% accuracy, 56.60% AUC and 46.20% EER. We then f ine-tune multiple architectures for Bengali deepfake detection. These include Wav2Vec2-Base, LCNN, LCNN-Attention, ResNet18, ViT-B16 and CNN-BiLSTM. Fine-tuned models show strong performance gains. ResNet18 achieves the highest accuracy of 79.17%, F1 score of 79.12%, AUC of 84.37% and EER of 24.35%. Experimental results confirm that fine-tuning significantly improves performance over zero-shot inference. This study provides the first systematic benchmark of Bengali deepfake audio detection. It highlights the effectiveness of f ine-tuned deep learning models for this low-resource language.


Key findings
Zero-shot inference with pretrained models showed limited effectiveness, with the best model (Wav2Vec2-XLSR-53) achieving only 53.80% accuracy. Fine-tuned models achieved significant performance gains, with ResNet18 performing best with 79.17% accuracy, 79.12% F1 score, 84.37% AUC, and 24.35% EER. The study confirms that fine-tuning deep learning models is essential for robust Bengali deepfake audio detection in low-resource settings.
Approach
The study first evaluates the zero-shot inference capabilities of large pretrained models like Wav2Vec2-XLSR-53, Whisper, and WavLM on the Bengali deepfake detection task. Following this, it fine-tunes a diverse set of deep learning architectures including Wav2Vec2-Base, LCNN, ResNet18, ViT-B16, LCNN-Attention, and CNN-BiLSTM on the BanglaFake dataset to improve performance.
Datasets
BanglaFake dataset (which sources real audio from SUST TTS Corpus and Mozilla Common Voice)
Model(s)
Wav2Vec2-XLSR-53, Whisper-small, Whisper-medium, PANNsCNN14, WavLM-Base-Plus, Audio Spectrogram Transformer (AST) (for zero-shot); Wav2Vec2-Base, LCNN, LCNN-Attention, ResNet18, ViT-B16, CNN-BiLSTM (for fine-tuning)
Author countries
Bangladesh