XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection

Authors: Kwok-Ho Ng, Tingting Song, Yongdong WU, Zhihua Xia

Published: 2026-01-06 11:41:05+00:00

Comment: 11 pages, 3 figures

AI Summary

This paper proposes XLSR-MamBo, a modular framework for audio deepfake detection that integrates an XLSR front-end with hybrid Mamba-Attention backbones. It leverages complementary strengths of State Space Models for temporal compression and Attention for global artifact retrieval. The framework achieves competitive performance and robust generalization across ASVspoof 2021 LA, DF, In-the-Wild, and DFADD benchmarks, with deeper backbones enhancing stability.

Abstract

Advanced speech synthesis technologies have enabled highly realistic speech generation, posing security risks that motivate research into audio deepfake detection (ADD). While state space models (SSMs) offer linear complexity, pure causal SSMs architectures often struggle with the content-based retrieval required to capture global frequency-domain artifacts. To address this, we explore the scaling properties of hybrid architectures by proposing XLSR-MamBo, a modular framework integrating an XLSR front-end with synergistic Mamba-Attention backbones. We systematically evaluate four topological designs using advanced SSM variants, Mamba, Mamba2, Hydra, and Gated DeltaNet. Experimental results demonstrate that the MamBo-3-Hydra-N3 configuration achieves competitive performance compared to other state-of-the-art systems on the ASVspoof 2021 LA, DF, and In-the-Wild benchmarks. This performance benefits from Hydra's native bidirectional modeling, which captures holistic temporal dependencies more efficiently than the heuristic dual-branch strategies employed in prior works. Furthermore, evaluations on the DFADD dataset demonstrate robust generalization to unseen diffusion- and flow-matching-based synthesis methods. Crucially, our analysis reveals that scaling backbone depth effectively mitigates the performance variance and instability observed in shallower models. These results demonstrate the hybrid framework's ability to capture artifacts in spoofed speech signals, providing an effective method for ADD.


Key findings
The MamBo-3-Hydra-N3 configuration achieved competitive performance, surpassing several state-of-the-art single systems on ASVspoof 2021 LA, DF, and In-the-Wild datasets. The framework demonstrated robust generalization capabilities against unseen diffusion- and flow-matching-based synthesis methods on the DFADD dataset. Furthermore, increasing backbone depth effectively mitigated performance variance and inference instability, enhancing detection robustness across diverse generative algorithms.
Approach
The XLSR-MamBo framework utilizes a pre-trained XLSR model for front-end feature extraction, followed by a hybrid Mamba-Attention backbone encoder. It systematically evaluates four topological designs (MamBo-1 to MamBo-4) integrating various State Space Model (SSM) variants (Mamba, Mamba2, Hydra, Gated DeltaNet) with Attention, while also exploring the impact of stacking depth. Gated attention pooling then aggregates features for binary spoof/bonafide classification.
Datasets
ASVspoof 2019 LA (training), ASVspoof 2021 LA, ASVspoof 2021 DF, In-the-Wild (ITW), DFADD
Model(s)
XLSR, Mamba, Mamba2, Hydra, Gated DeltaNet (GDN), MamBo (MamBo-1, MamBo-2, MamBo-3, MamBo-4) hybrid architectures
Author countries
China