Retrieval-Augmented Audio Deepfake Detection

View on arXiv ← Back to list

Authors: Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang

Published: 2024-04-22 05:46:40+00:00

AI Summary

This paper proposes a Retrieval-Augmented Detection (RAD) framework for audio deepfake detection, which augments test samples with similar retrieved samples to improve detection accuracy. The RAD framework, extended with a multi-fusion attentive classifier, achieves state-of-the-art results on ASVspoof 2021 DF and competitive results on 2019 and 2021 LA datasets.

Abstract

With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired by retrieval-augmented generation (RAG), we propose a retrieval-augmented detection (RAD) framework that augments test samples with similar retrieved samples for enhanced detection. We also extend the multi-fusion attentive classifier to integrate it with our proposed RAD framework. Extensive experiments show the superior performance of the proposed RAD framework over baseline methods, achieving state-of-the-art results on the ASVspoof 2021 DF set and competitive results on the 2019 and 2021 LA sets. Further sample analysis indicates that the retriever consistently retrieves samples mostly from the same speaker with acoustic characteristics highly consistent with the query audio, thereby improving detection performance.

Key findings

The RAD framework significantly outperforms baseline methods, achieving state-of-the-art performance on the ASVspoof 2021 DF dataset and competitive results on the 2019 and 2021 LA datasets. Ablation studies confirm the importance of the RAD framework and the use of additional VCTK data for retrieval. Analysis suggests that the retriever prioritizes samples from the same speaker, improving detection accuracy.

Approach

The authors address the limitations of single-model audio deepfake detection by incorporating a retrieval module. This module retrieves similar bonafide audio samples, which are then fused with the test sample and fed into a multi-fusion attentive classifier for enhanced detection.

Datasets

ASVspoof 2019 LA, ASVspoof 2021 LA, ASVspoof 2021 DF, VCTK

Model(s)

WavLM (for feature extraction), Multi-Fusion Attentive (MFA) classifier (extended with RAD)

Author countries

China

← Previous