RPRA-ADD: Forgery Trace Enhancement-Driven Audio Deepfake Detection

Authors: Ruibo Fu, Xiaopeng Wang, Zhengqi Wen, Jianhua Tao, Yuankun Xie, Zhiyong Wang, Chunyu Qiang, Xuefei Liu, Cunhang Fan, Chenxing Li, Guanjun Li

Published: 2025-05-31 04:03:38+00:00

AI Summary

This paper introduces RPRA-ADD, a robust audio deepfake detection framework that enhances forgery traces using Reconstruction-Perception-Reinforcement-Attention networks. It improves upon existing methods by focusing on learning intrinsic differences between real and fake audio, leading to state-of-the-art performance on multiple benchmark datasets and strong cross-domain generalization.

Abstract

Existing methods for deepfake audio detection have demonstrated some effectiveness. However, they still face challenges in generalizing to new forgery techniques and evolving attack patterns. This limitation mainly arises because the models rely heavily on the distribution of the training data and fail to learn a decision boundary that captures the essential characteristics of forgeries. Additionally, relying solely on a classification loss makes it difficult to capture the intrinsic differences between real and fake audio. In this paper, we propose the RPRA-ADD, an integrated Reconstruction-Perception-Reinforcement-Attention networks based forgery trace enhancement-driven robust audio deepfake detection framework. First, we propose a Global-Local Forgery Perception (GLFP) module for enhancing the acoustic perception capacity of forgery traces. To significantly reinforce the feature space distribution differences between real and fake audio, the Multi-stage Dispersed Enhancement Loss (MDEL) is designed, which implements a dispersal strategy in multi-stage feature spaces. Furthermore, in order to enhance feature awareness towards forgery traces, the Fake Trace Focused Attention (FTFA) mechanism is introduced to adjust attention weights dynamically according to the reconstruction discrepancy matrix. Visualization experiments not only demonstrate that FTFA improves attention to voice segments, but also enhance the generalization capability. Experimental results demonstrate that the proposed method achieves state-of-the-art performance on 4 benchmark datasets, including ASVspoof2019, ASVspoof2021, CodecFake, and FakeSound, achieving over 20% performance improvement. In addition, it outperforms existing methods in rigorous 3*3 cross-domain evaluations across Speech, Sound, and Singing, demonstrating strong generalization capability across diverse audio domains.


Key findings
RPRA-ADD achieves state-of-the-art performance on four benchmark datasets, showing over 20% improvement. It also outperforms existing methods in cross-domain evaluations across speech, sound, and singing, demonstrating strong generalization capabilities. Visualization experiments confirm the effectiveness of the attention mechanism in focusing on relevant audio segments.
Approach
RPRA-ADD uses an encoder-decoder architecture with a novel Global-Local Forgery Perception (GLFP) module to enhance forgery trace perception, a Multi-stage Dispersed Enhancement Loss (MDEL) to reinforce feature space differences, and a Fake Trace Focused Attention (FTFA) mechanism to highlight forgery traces. The enhanced features are then classified using AASIST.
Datasets
ASVspoof2019, ASVspoof2021, CodecFake, FakeSound, SingFake
Model(s)
AudioMAE (encoder-decoder), AASIST (classifier)
Author countries
China