Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy

Authors: Botao Zhao, Zuheng Kang, Yayun He, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang

Published: 2025-04-15 02:39:46+00:00

AI Summary

This paper proposes f-InfoED, a frame-level latent information entropy detector, for generalized audio deepfake detection. It leverages the variational information bottleneck to extract discriminative information entropy from latent representations, achieving state-of-the-art performance and remarkable generalization capabilities.

Abstract

Generalizability, the capacity of a robust model to perform effectively on unseen data, is crucial for audio deepfake detection due to the rapid evolution of text-to-speech (TTS) and voice conversion (VC) technologies. A promising approach to differentiate between bonafide and spoof samples lies in identifying intrinsic disparities to enhance model generalizability. From an information-theoretic perspective, we hypothesize the information content is one of the intrinsic differences: bonafide sample represents a dense, information-rich sampling of the real world, whereas spoof sample is typically derived from lower-dimensional, less informative representations. To implement this, we introduce frame-level latent information entropy detector(f-InfoED), a framework that extracts distinctive information entropy from latent representations at the frame level to identify audio deepfakes. Furthermore, we present AdaLAM, which extends large pre-trained audio models with trainable adapters for enhanced feature extraction. To facilitate comprehensive evaluation, the audio deepfake forensics 2024 (ADFF 2024) dataset was built by the latest TTS and VC methods. Extensive experiments demonstrate that our proposed approach achieves state-of-the-art performance and exhibits remarkable generalization capabilities. Further analytical studies confirms the efficacy of AdaLAM in extracting discriminative audio features and f-InfoED in leveraging latent entropy information for more generalized deepfake detection.


Key findings
The proposed method achieves state-of-the-art performance on various datasets, including cross-dataset and in-the-wild evaluations. It exhibits strong generalization capabilities to unseen perturbations (different audio durations and MP3 compression rates) and demonstrates potential for application in other modalities like image deepfake detection.
Approach
The approach uses a two-component framework: AdaLAM, which adapts large pre-trained audio models for enhanced feature extraction, and f-InfoED, which calculates frame-level latent information entropy from the extracted features to differentiate between real and fake audio. The model is trained using a multi-task learning approach optimizing reconstruction loss, KL loss, and classification loss.
Datasets
ASVspoof 2019 LA, ASVspoof 2021 DF, In-the-wild dataset, and the newly created ADFF 2024 dataset.
Model(s)
AdaLAM (adapted large pre-trained audio models, such as WavLM base+), f-InfoED (frame-level latent information entropy detector), HiFi-GAN (used in the decoder).
Author countries
China