CodecFake+: A Large-Scale Neural Audio Codec-Based Deepfake Speech Dataset

View on arXiv ← Back to list

Authors: Xuanjun Chen, Jiawei Du, Haibin Wu, Lin Zhang, I-Ming Lin, I-Hsiang Chiu, Wenze Ren, Yuan Tseng, Yu Tsao, Jyh-Shing Roger Jang, Hung-yi Lee

Published: 2025-01-14 16:26:14+00:00

AI Summary

This paper introduces CodecFake+, a large-scale dataset for detecting deepfake speech generated by codec-based speech generation (CoSG) systems. It also proposes a taxonomy for categorizing neural audio codecs, enabling detailed analysis of factors influencing CodecFake detection performance.

Abstract

With the rapid advancement of neural audio codecs, codec-based speech generation (CoSG) systems have become highly powerful. Unfortunately, CoSG also enables the creation of highly realistic deepfake speech, making it easier to mimic an individual's voice and spread misinformation. We refer to this emerging deepfake speech generated by CoSG systems as CodecFake. Detecting such CodecFake is an urgent challenge, yet most existing systems primarily focus on detecting fake speech generated by traditional speech synthesis models. In this paper, we introduce CodecFake+, a large-scale dataset designed to advance CodecFake detection. To our knowledge, CodecFake+ is the largest dataset encompassing the most diverse range of codec architectures. The training set is generated through re-synthesis using 31 publicly available open-source codec models, while the evaluation set includes web-sourced data from 17 advanced CoSG models. We also propose a comprehensive taxonomy that categorizes codecs by their root components: vector quantizer, auxiliary objectives, and decoder types. Our proposed dataset and taxonomy enable detailed analysis at multiple levels to discern the key factors for successful CodecFake detection. At the individual codec level, we validate the effectiveness of using codec re-synthesized speech (CoRS) as training data for large-scale CodecFake detection. At the taxonomy level, we show that detection performance is strongest when the re-synthesis model incorporates disentanglement auxiliary objectives or a frequency-domain decoder. Furthermore, from the perspective of using all the CoRS training data, we show that our proposed taxonomy can be used to select better training data for improving detection performance. Overall, we envision that CodecFake+ will be a valuable resource for both general and fine-grained exploration to develop better anti-spoofing models against CodecFake.

Key findings

Using codec re-synthesized speech (CoRS) as training data is effective for CodecFake detection. Detection performance is strongest when the re-synthesis model incorporates disentanglement auxiliary objectives or a frequency-domain decoder. The proposed taxonomy helps select better training data for improved detection performance.

Approach

The authors create CodecFake+, a dataset with training data generated by re-synthesizing speech using 31 open-source codecs and evaluation data from 17 CoSG models. They also propose a taxonomy to categorize codecs based on vector quantizers, auxiliary objectives, and decoder types, allowing for multi-level analysis of CodecFake detection.

Datasets

CodecFake+ (training data generated from 31 codecs and VCTK corpus, evaluation data from 17 CoSG models and web-sourced data), VCTK, ASVspoof2019, E2 TTS test set, MaskGCT-VCTK

Model(s)

W2V2-AASIST (with RawBoost data augmentation)

Author countries

Taiwan, Czech Republic

← Previous