Towards Neural Audio Codec Source Parsing

Authors: Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Arun Balaji Buduru, Rajesh Sharma

Published: 2025-06-14 21:00:39+00:00

AI Summary

This paper introduces Neural Audio Codec Source Parsing (NACSP), a novel paradigm reframing audio deepfake source attribution as a multi-task regression problem to predict generative Neural Audio Codec (NAC) parameters. It proposes HYDRA, a framework leveraging hyperbolic geometry and task-specific attention to disentangle latent properties from pre-trained model representations. HYDRA significantly outperforms Euclidean baselines on benchmark codecfake datasets, enabling more granular and generalizable forensic insights into unseen NACs.

Abstract

A new class of audio deepfakes-codecfakes (CFs)-has recently caught attention, synthesized by Audio Language Models that leverage neural audio codecs (NACs) in the backend. In response, the community has introduced dedicated benchmarks and tailored detection strategies. As the field advances, efforts have moved beyond binary detection toward source attribution, including open-set attribution, which aims to identify the NAC responsible for generation and flag novel, unseen ones during inference. This shift toward source attribution improves forensic interpretability and accountability. However, open-set attribution remains fundamentally limited: while it can detect that a NAC is unfamiliar, it cannot characterize or identify individual unseen codecs. It treats such inputs as generic ``unknowns'', lacking insight into their internal configuration. This leads to major shortcomings: limited generalization to new NACs and inability to resolve fine-grained variations within NAC families. To address these gaps, we propose Neural Audio Codec Source Parsing (NACSP) - a paradigm shift that reframes source attribution for CFs as structured regression over generative NAC parameters such as quantizers, bandwidth, and sampling rate. We formulate NACSP as a multi-task regression task for predicting these NAC parameters and establish the first comprehensive benchmark using various state-of-the-art speech pre-trained models (PTMs). To this end, we propose HYDRA, a novel framework that leverages hyperbolic geometry to disentangle complex latent properties from PTM representations. By employing task-specific attention over multiple curvature-aware hyperbolic subspaces, HYDRA enables superior multi-task generalization. Our extensive experiments show HYDRA achieves top results on benchmark CFs datasets compared to baselines operating in Euclidean space.


Key findings
HYDRA significantly outperforms Euclidean baselines in predicting NAC parameters (quantizers, sampling rate, and bits per second) across both closed-set and open-set scenarios on benchmark datasets. While HYDRA yields substantial performance improvements, the specific choice of pre-trained model (PTM) has minimal impact on the overall NACSP outcomes. HYDRA, particularly when using x-vector representations, establishes a new state-of-the-art for neural audio codec source parsing.
Approach
The authors propose Neural Audio Codec Source Parsing (NACSP), which reframes audio deepfake source attribution as a multi-task regression task to predict generative NAC parameters (quantizers, sampling rate, bits per second). They introduce HYDRA, a framework that takes pre-trained model (PTM) representations, processes them through a shared convolutional block, projects them into task-specific hyperbolic subspaces, aggregates features using attention, and then maps them back to Euclidean space for final regression via task-specific fully connected networks.
Datasets
ST-codecfake, CodecFake
Model(s)
HYDRA, and various Pre-trained Models (PTMs) including Unispeech-SAT, WavLM, Wav2vec2, XLS-R, Whisper, MMS, x-vector, and ECAPA.
Author countries
India, Estonia