BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention

Authors: Yassine El Kheir, Tim Polzehl, Sebastian Möller

Published: 2025-05-20 04:52:59+00:00

AI Summary

BiCrossMamba-ST is a speech deepfake detection framework using a dual-branch spectro-temporal architecture with bidirectional Mamba blocks and cross-attention. It achieves significant performance improvements over state-of-the-art methods on ASVSpoof LA21 and DF21 benchmarks by effectively capturing subtle cues of synthetic speech.

Abstract

We propose BiCrossMamba-ST, a robust framework for speech deepfake detection that leverages a dual-branch spectro-temporal architecture powered by bidirectional Mamba blocks and mutual cross-attention. By processing spectral sub-bands and temporal intervals separately and then integrating their representations, BiCrossMamba-ST effectively captures the subtle cues of synthetic speech. In addition, our proposed framework leverages a convolution-based 2D attention map to focus on specific spectro-temporal regions, enabling robust deepfake detection. Operating directly on raw features, BiCrossMamba-ST achieves significant performance improvements, a 67.74% and 26.3% relative gain over state-of-the-art AASIST on ASVSpoof LA21 and ASVSpoof DF21 benchmarks, respectively, and a 6.80% improvement over RawBMamba on ASVSpoof DF21. Code and models will be made publicly available.


Key findings
BiCrossMamba-ST outperforms state-of-the-art methods, achieving a 67.74% and 26.3% relative gain over AASIST on ASVSpoof LA21 and DF21, respectively. Ablation studies confirmed the importance of the 2D attention map and mutual cross-attention. The BiMamba architecture showed significant advantages over GAT and Transformer alternatives.
Approach
The approach uses a dual-branch architecture processing spectral and temporal features separately via bidirectional Mamba blocks. A 2D attention map focuses on crucial spectro-temporal regions, and mutual cross-attention integrates the processed features for classification.
Datasets
ASVspoof 2019 LA, ASVspoof 2021 LA, ASVspoof 2021 DF, ASVspoof5
Model(s)
BiCrossMamba-ST (which incorporates BiMamba blocks and mutual cross-attention), Mamba, inBiMamba, FlipMamba, RawNet2, AASIST, SE-Rawformer, RawBMamba, HM-Conformer
Author countries
Germany