RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection

View on arXiv ← Back to list

Authors: Yujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Lv Zhao, Cunhang Fan

Published: 2024-06-10 08:13:42+00:00

AI Summary

This paper introduces RawBMamba, a bidirectional end-to-end state space model for audio deepfake detection. It combines short-range features extracted using sinc layers and convolutional layers with long-range features captured by a bidirectional Mamba model, improving upon the unidirectional limitations of previous Mamba models. The resulting model significantly outperforms existing methods on several datasets.

Abstract

Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepfake detection. Specifically, we use sinc Layer and multiple convolutional layers to capture short-range features, and then design a bidirectional Mamba to address Mamba's unidirectional modelling problem and further capture long-range feature information. Moreover, we develop a bidirectional fusion module to integrate embeddings, enhancing audio context representation and combining short- and long-range information. The results show that our proposed RawBMamba achieves a 34.1% improvement over Rawformer on ASVspoof2021 LA dataset, and demonstrates competitive performance on other datasets.

Key findings

RawBMamba achieved a 34.1% improvement over Rawformer on the ASVspoof2021 LA dataset. It demonstrated competitive performance on other datasets, showcasing its generalizability. Visualization analysis suggests Mamba's superior feature discriminability compared to Transformers for this task.

Approach

RawBMamba uses sinc layers and convolutional layers to extract short-range audio features. A bidirectional Mamba model captures long-range features, addressing the unidirectional limitation of previous Mamba models. A bidirectional fusion module combines these short and long-range features for improved deepfake detection.

Datasets

ASVspoof2019 LA, ASVspoof2021 LA, ASVspoof2021 DF

Model(s)

Bidirectional Mamba state space model, SincNet, ResNet blocks with squeeze-and-excitation operations.

Author countries

China

← Previous