Fusion of Modulation Spectrogram and SSL with Multi-head Attention for Fake Speech Detection

Authors: Rishith Sadashiv T N, Abhishek Bedge, Saisha Suresh Bore, Jagabandhu Mishra, Mrinmoy Bhattacharjee, S R Mahadeva Prasanna

Published: 2025-08-01 19:20:18+00:00

AI Summary

This paper proposes a novel fake speech detection model that fuses self-supervised learning (SSL) speech embeddings with modulation spectrogram features using multi-head attention. The resulting fused representation is then fed into an AASIST network for classification, achieving significant performance improvements over a baseline model in both in-domain and out-of-domain scenarios.

Abstract

Fake speech detection systems have become a necessity to combat against speech deepfakes. Current systems exhibit poor generalizability on out-of-domain speech samples due to lack to diverse training data. In this paper, we attempt to address domain generalization issue by proposing a novel speech representation using self-supervised (SSL) speech embeddings and the Modulation Spectrogram (MS) feature. A fusion strategy is used to combine both speech representations to introduce a new front-end for the classification task. The proposed SSL+MS fusion representation is passed to the AASIST back-end network. Experiments are conducted on monolingual and multilingual fake speech datasets to evaluate the efficacy of the proposed model architecture in cross-dataset and multilingual cases. The proposed model achieves a relative performance improvement of 37% and 20% on the ASVspoof 2019 and MLAAD datasets, respectively, in in-domain settings compared to the baseline. In the out-of-domain scenario, the model trained on ASVspoof 2019 shows a 36% relative improvement when evaluated on the MLAAD dataset. Across all evaluated languages, the proposed model consistently outperforms the baseline, indicating enhanced domain generalization.


Key findings
The proposed model significantly outperforms the baseline in both in-domain and cross-dataset evaluations, showing a relative improvement of up to 37%. The model also demonstrates improved generalization across multiple languages, indicating enhanced robustness to domain shifts and language variations.
Approach
The authors address the domain generalization problem in fake speech detection by fusing self-supervised learning (SSL) embeddings with modulation spectrogram features. A multi-head attention mechanism combines these representations, which are then input to an AASIST network for classification. This approach aims to leverage both frame-level and prosodic-level information for improved robustness.
Datasets
ASVspoof 2019 Logical Access (LA), ASVspoof 2021 LA, and MLAAD datasets.
Model(s)
wav2vec 2.0 XLS-R (for SSL embeddings), AASIST network.
Author countries
India, Finland