The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

Authors: Yuankun Xie, Yi Lu, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun

Published: 2024-05-08 08:28:40+00:00

AI Summary

This paper introduces Codecfake, a large-scale dataset of over 1 million audio samples for detecting ALM-based deepfake audio generated using neural codecs. To improve generalization, they propose CSAM, a co-training sharpness aware minimization strategy that addresses domain ascent bias, achieving a low average equal error rate (EER) of 0.616%.

Abstract

With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially constructed the Codecfake dataset, an open-source, large-scale collection comprising over 1 million audio samples in both English and Chinese, focus on ALM-based audio detection. As countermeasure, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original sharpness aware minimization (SAM), we propose the CSAM strategy to learn a domain balanced and generalized minima. In our experiments, we first demonstrate that ADD model training with the Codecfake dataset can effectively detects ALM-based audio. Furthermore, our proposed generalization countermeasure yields the lowest average equal error rate (EER) of 0.616% across all test conditions compared to baseline models. The dataset and associated code are available online.


Key findings
Models trained solely on vocoder-based data perform poorly on Codecfake. The Codecfake dataset and CSAM strategy significantly improve deepfake audio detection, achieving a 0.616% average EER across diverse test conditions. The CSAM method effectively mitigates domain ascent bias in co-training scenarios.
Approach
The authors created the Codecfake dataset using seven neural codec models to generate deepfake audio from real audio datasets. They also proposed CSAM, a co-training strategy using SAM (Sharpness Aware Minimization) to improve the generalization of audio deepfake detection models by addressing domain ascent bias during the co-training process.
Datasets
Codecfake (1 million+ samples in English and Chinese, generated using 7 neural codec models from VCTK and AISHELL-3), ASVspoof2019LA, ITW (In-The-Wild), ADD2023T1.2, LibriTTS, Audiocaps
Model(s)
Mel-LCNN, W2V2-LCNN, WavLM-AASIST, W2V2-AASIST (with SAM, ASAM, and CSAM variations)
Author countries
China