Towards Generalized Source Tracing for Codec-Based Deepfake Speech

View on arXiv ← Back to list

Authors: Xuanjun Chen, I-Ming Lin, Lin Zhang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

Published: 2025-06-08 21:36:10+00:00

AI Summary

This paper addresses the suboptimal performance of source tracing models for codec-based deepfake speech. It introduces SASTNet, a novel network that jointly leverages semantic and acoustic features for improved generalization and state-of-the-art performance on the CodecFake+ dataset.

Abstract

Recent attempts at source tracing for codec-based deepfake speech (CodecFake), generated by neural audio codec-based speech generation (CoSG) models, have exhibited suboptimal performance. However, how to train source tracing models using simulated CoSG data while maintaining strong performance on real CoSG-generated audio remains an open challenge. In this paper, we show that models trained solely on codec-resynthesized data tend to overfit to non-speech regions and struggle to generalize to unseen content. To mitigate these challenges, we introduce the Semantic-Acoustic Source Tracing Network (SASTNet), which jointly leverages Whisper for semantic feature encoding and Wav2vec2 with AudioMAE for acoustic feature encoding. Our proposed SASTNet achieves state-of-the-art performance on the CoSG test set of the CodecFake+ dataset, demonstrating its effectiveness for reliable source tracing.

Key findings

SASTNet achieves state-of-the-art performance on the CodecFake+ CoSG test set. The model demonstrates improved robustness to unseen content and silence compared to baseline models. Joint semantic and acoustic feature encoding is crucial for effective source tracing.

Approach

SASTNet uses Whisper for semantic feature encoding and Wav2vec2 with AudioMAE for acoustic feature encoding. A cross-modal transformer fuses these features, and an AASIST classifier predicts the codec source. This approach addresses overfitting to non-speech regions and improves generalization.

Datasets

CodecFake+ dataset (CoRS and CoSG subsets)

Model(s)

Whisper, Wav2vec2, AudioMAE, AASIST

Author countries

Taiwan, USA

← Previous