DIN-CTS: Low-Complexity Depthwise-Inception Neural Network with Contrastive Training Strategy for Deepfake Speech Detection

Authors: Lam Pham, Dat Tran, Phat Lam, Florian Skopik, Alexander Schindler, Silvia Poletti, David Fischinger, Martin Boyer

Published: 2025-02-27 16:09:04+00:00

AI Summary

This paper proposes DIN-CTS, a low-complexity deepfake speech detection system using a Depthwise-Inception Network (DIN) trained with a contrastive training strategy (CTS). The system transforms audio into spectrograms, trains a DIN to extract audio embeddings for bonafide speech, and detects deepfakes by computing the distance of test utterances from this distribution.

Abstract

In this paper, we propose a deep neural network approach for deepfake speech detection (DSD) based on a lowcomplexity Depthwise-Inception Network (DIN) trained with a contrastive training strategy (CTS). In this framework, input audio recordings are first transformed into spectrograms using Short-Time Fourier Transform (STFT) and Linear Filter (LF), which are then used to train the DIN. Once trained, the DIN processes bonafide utterances to extract audio embeddings, which are used to construct a Gaussian distribution representing genuine speech. Deepfake detection is then performed by computing the distance between a test utterance and this distribution to determine whether the utterance is fake or bonafide. To evaluate our proposed systems, we conducted extensive experiments on the benchmark dataset of ASVspoof 2019 LA. The experimental results demonstrate the effectiveness of combining the Depthwise-Inception Network with the contrastive learning strategy in distinguishing between fake and bonafide utterances. We achieved Equal Error Rate (EER), Accuracy (Acc.), F1, AUC scores of 4.6%, 95.4%, 97.3%, and 98.9% respectively using a single, low-complexity DIN with just 1.77 M parameters and 985 M FLOPS on short audio segments (4 seconds). Furthermore, our proposed system outperforms the single-system submissions in the ASVspoof 2019 LA challenge, showcasing its potential for real-time applications.


Key findings
DIN-CTS achieved an Equal Error Rate (EER) of 4.6%, outperforming single-system submissions in the ASVspoof 2019 LA challenge. The model is low-complexity (1.77M parameters, 985M FLOPS), suitable for real-time applications. The contrastive training strategy effectively separated the distributions of bonafide and deepfake speech.
Approach
The system uses Short-Time Fourier Transform (STFT) and Linear Filter (LF) to generate spectrograms from audio. A Depthwise-Inception Network (DIN) is trained using a contrastive training strategy with multiple losses to learn distinct features between bonafide and deepfake speech. Deepfake detection is then performed by calculating the Mahalanobis distance between a test utterance's embedding and a pre-computed Gaussian distribution of bonafide speech embeddings.
Datasets
ASVspoof 2019 LA dataset
Model(s)
Depthwise-Inception Network (DIN)
Author countries
Austria, Vietnam