XLSR-Mamba: A Dual-Column Bidirectional State Space Model for Spoofing Attack Detection

Authors: Yang Xiao, Rohan Kumar Das

Published: 2024-11-15 08:13:51+00:00

AI Summary

This paper proposes XLSR-Mamba, a dual-column bidirectional state space model for spoofing attack detection. It combines a pre-trained wav2vec 2.0 model with a novel Mamba architecture to achieve competitive results and faster inference compared to transformer-based models.

Abstract

Transformers and their variants have achieved great success in speech processing. However, their multi-head self-attention mechanism is computationally expensive. Therefore, one novel selective state space model, Mamba, has been proposed as an alternative. Building on its success in automatic speech recognition, we apply Mamba for spoofing attack detection. Mamba is well-suited for this task as it can capture the artifacts in spoofed speech signals by handling long-length sequences. However, Mamba's performance may suffer when it is trained with limited labeled data. To mitigate this, we propose combining a new structure of Mamba based on a dual-column architecture with self-supervised learning, using the pre-trained wav2vec 2.0 model. The experiments show that our proposed approach achieves competitive results and faster inference on the ASVspoof 2021 LA and DF datasets, and on the more challenging In-the-Wild dataset, it emerges as the strongest candidate for spoofing attack detection. The code has been publicly released in https://github.com/swagshaw/XLSR-Mamba.


Key findings
XLSR-Mamba outperforms state-of-the-art models on ASVspoof 2021 LA and DF datasets and the more challenging In-the-Wild dataset. It also achieves faster inference compared to transformer-based models, making it suitable for real-time applications. The dual-column architecture is shown to be particularly effective in capturing both fine-grained and broader patterns in spoofed speech.
Approach
The authors address the computational expense of transformers by using a dual-column bidirectional Mamba architecture. This architecture processes speech signals forward and backward, concatenating the outputs to capture both local and global features. Self-supervised learning with wav2vec 2.0 is used for pre-training to mitigate the impact of limited labeled data.
Datasets
ASVspoof 2021 LA, ASVspoof 2021 DF, In-the-Wild
Model(s)
Dual-Column Bidirectional Mamba (DuaBiMamba), wav2vec 2.0 (XLSR)
Author countries
Singapore