Neural Codec Source Tracing: Toward Comprehensive Attribution in Open-Set Condition

View on arXiv ← Back to list

Authors: Yuankun Xie, Xiaopeng Wang, Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Songjun Cao, Long Ma, Chenxing Li, Haonnan Cheng, Long Ye

Published: 2025-01-11 11:15:58+00:00

AI Summary

This paper introduces the Neural Codec Source Tracing (NCST) task for open-set audio deepfake detection, encompassing both neural codec classification and ALM detection. A new dataset, ST-Codecfake, is created to benchmark NCST models under open-set conditions, revealing limitations in classifying unseen real audio despite strong performance on in-distribution and out-of-distribution tasks.

Abstract

Current research in audio deepfake detection is gradually transitioning from binary classification to multi-class tasks, referred as audio deepfake source tracing task. However, existing studies on source tracing consider only closed-set scenarios and have not considered the challenges posed by open-set conditions. In this paper, we define the Neural Codec Source Tracing (NCST) task, which is capable of performing open-set neural codec classification and interpretable ALM detection. Specifically, we constructed the ST-Codecfake dataset for the NCST task, which includes bilingual audio samples generated by 11 state-of-the-art neural codec methods and ALM-based out-ofdistribution (OOD) test samples. Furthermore, we establish a comprehensive source tracing benchmark to assess NCST models in open-set conditions. The experimental results reveal that although the NCST models perform well in in-distribution (ID) classification and OOD detection, they lack robustness in classifying unseen real audio. The ST-codecfake dataset and code are available.

Key findings

While NCST models achieve high accuracy in in-distribution classification and out-of-distribution detection, they struggle to robustly classify unseen real audio. Logits-based OOD detection methods perform better than feature-based methods. The model also demonstrates good performance in identifying the backend of ALM-based audio.

Approach

The authors propose the NCST task and create the ST-Codecfake dataset containing audio from 11 neural codecs and real audio. They evaluate three baseline models (Mel-LCNN, AASIST, and W2V2-AASIST) on this dataset using in-distribution and out-of-distribution detection metrics, analyzing performance across various conditions (unseen real audio, different configurations and sources).

Datasets

ST-Codecfake (includes bilingual audio from 11 neural codec methods, real audio from VCTK and AISHELL3, and ALM-based OOD samples); VCTK; AISHELL3; ASVspoof2019LA; In the Wild (ITW); NC-SSD; Codecfake

Model(s)

Mel-LCNN, AASIST, W2V2-AASIST

Author countries

China

← Previous