Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy

Authors: Xuanjun Chen, I-Ming Lin, Lin Zhang, Jiawei Du, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

Published: 2025-05-19 11:31:32+00:00

Comment: Accepted by Interspeech 2025; Update table 3/4

AI Summary

This paper introduces a novel method for tracing the source of codec-based audio deepfakes (CodecFakes) by analyzing neural audio codec taxonomy. It defines three multi-class classification tasks based on vector quantization, auxiliary objectives, and decoder types, integrated into a multi-task training framework. Experimental results on the CodecFake+ dataset demonstrate the feasibility of this source tracing approach while also highlighting challenges, particularly with out-of-domain deepfakes.

Abstract

Recent advances in neural audio codec-based speech generation (CoSG) models have produced remarkably realistic audio deepfakes. We refer to deepfake speech generated by CoSG systems as codec-based deepfake, or CodecFake. Although existing anti-spoofing research on CodecFake predominantly focuses on verifying the authenticity of audio samples, almost no attention was given to tracing the CoSG used in generating these deepfakes. In CodecFake generation, processes such as speech-to-unit encoding, discrete unit modeling, and unit-to-speech decoding are fundamentally based on neural audio codecs. Motivated by this, we introduce source tracing for CodecFake via neural audio codec taxonomy, which dissects neural audio codecs to trace CoSG. Our experimental results on the CodecFake+ dataset provide promising initial evidence for the feasibility of CodecFake source tracing while also highlighting several challenges that warrant further investigation.


Key findings
The study provides promising initial evidence for the feasibility of CodecFake source tracing, achieving F1 scores of 96%-97% on in-domain data. Performance notably declines for out-of-domain CodecFake samples, indicating challenges with generative modeling and unseen codecs. Balancing training data according to auxiliary objectives was found to yield stronger generalization across source tracing tasks.
Approach
The authors propose source tracing CodecFake via neural audio codec taxonomy, defining three multi-class classification tasks: Vector Quantization, Auxiliary Objective, and Decoder Type classification. These tasks are implemented within a multi-task training framework, sharing a Wav2Vec2-AASIST frontend and using independent backends for classification, including binary spoof detection.
Datasets
CodecFake+ dataset (comprising CoRS and CoSG subsets), VCTK
Model(s)
Wav2Vec2-AASIST (backbone), pre-trained Wav2Vec 2.0 Encoder, RawNet2 Encoder
Author countries
Taiwan, Czech Republic