How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection

Authors: Yixuan Xiao, Florian Lux, Alejandro Pérez-González-de-Martos, Ngoc Thang Vu

Published: 2026-02-18 10:29:07+00:00

Comment: Accepted to ICASSP 2026

AI Summary

This study addresses the critical challenge of labeling resynthesized audio from neural audio codecs (NACs) in audio deepfake detection, given their dual role in compression and speech synthesis. The paper introduces a new, challenging dataset, CodecDeepfakeDetection (CDD), which extends ASVspoof 5. It thoroughly investigates how different labeling choices for codec-resynthesized audio (CoRS) affect deepfake detection performance and provides insights into optimal labeling strategies.

Abstract

Since Text-to-Speech systems typically don't produce waveforms directly, recent spoof detection studies use resynthesized waveforms from vocoders and neural audio codecs to simulate an attacker. Unlike vocoders, which are specifically designed for speech synthesis, neural audio codecs were originally developed for compressing audio for storage and transmission. However, their ability to discretize speech also sparked interest in language-modeling-based speech synthesis. Owing to this dual functionality, codec resynthesized data may be labeled as either bonafide or spoof. So far, very little research has addressed this issue. In this study, we present a challenging extension of the ASVspoof 5 dataset constructed for this purpose. We examine how different labeling choices affect detection performance and provide insights into labeling strategies.

Key findings

The newly introduced CodecDeepfakeDetection dataset poses a significant challenge, with existing detectors exhibiting high Equal Error Rates. The optimal labeling strategy for codec-resynthesized audio (CoRS) depends on the NAC's design objective: treating compression-oriented CoRS as bonafide helps detectors learn codec-invariant cues, while treating synthesis-oriented CoRS as spoof is more effective for exploiting codec-specific artifacts. NACs proficient in both compression and synthesis present a complex and ongoing challenge for disentangling codec artifacts from spoofness.

Approach

The authors construct a novel, challenging dataset called CodecDeepfakeDetection, an extension of ASVspoof 5, specifically designed to explore the dual role of neural audio codecs (NACs). They then conduct experiments using state-of-the-art audio deepfake detectors (X-AASIST and LWBN) to assess the impact of various labeling strategies for codec-resynthesized audio (treating CoRS as either bonafide or spoof) on detection performance across different types of codec-based attacks.

Datasets

CodecDeepfakeDetection (CDD), ASVspoof 5

Model(s)

XLS-R (frontend), AASIST (backend, forming X-AASIST), LWBN (backend)

Author countries

Germany

← Previous