Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy

Authors: Xuanjun Chen, I-Ming Lin, Lin Zhang, Jiawei Du, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

Published: 2025-05-19 11:31:32+00:00

AI Summary

This paper introduces a novel approach for tracing the source of codec-based audio deepfakes (CodecFake) by analyzing their underlying neural audio codecs. The approach leverages a neural audio codec taxonomy to identify characteristic features of the codecs used in generating the deepfakes, enabling source tracing even for unseen CoSG systems.

Abstract

Recent advances in neural audio codec-based speech generation (CoSG) models have produced remarkably realistic audio deepfakes. We refer to deepfake speech generated by CoSG systems as codec-based deepfake, or CodecFake. Although existing anti-spoofing research on CodecFake predominantly focuses on verifying the authenticity of audio samples, almost no attention was given to tracing the CoSG used in generating these deepfakes. In CodecFake generation, processes such as speech-to-unit encoding, discrete unit modeling, and unit-to-speech decoding are fundamentally based on neural audio codecs. Motivated by this, we introduce source tracing for CodecFake via neural audio codec taxonomy, which dissects neural audio codecs to trace CoSG. Our experimental results on the CodecFake+ dataset provide promising initial evidence for the feasibility of CodecFake source tracing while also highlighting several challenges that warrant further investigation.


Key findings
The multi-task learning approach achieves high accuracy in identifying the codec components, demonstrating the feasibility of CodecFake source tracing. However, performance significantly degrades when encountering unseen codecs or CoSG systems, highlighting the challenge of generalization in this task. Balancing training data based on auxiliary objectives improves generalization.
Approach
The authors propose a multi-task learning framework that classifies CodecFake audio based on its vector quantization, auxiliary objectives, and decoder type, as defined by a neural audio codec taxonomy. This framework is trained on the CodecFake+ dataset and uses Wav2Vec2-AASIST as the backbone model.
Datasets
CodecFake+ dataset (CoRS and CoSG subsets)
Model(s)
Modified Wav2Vec2-AASIST with multi-task learning
Author countries
Taiwan, Czech Republic