Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy

Authors: Botao Zhao, Zuheng Kang, Yayun He, Xiaoyang Qu, Junqing Peng, Jing Xiao, Jianzong Wang

Published: 2025-04-15 02:39:46+00:00

Comment: Accpeted by IEEE International Conference on Multimedia & Expo 2025 (ICME 2025)

AI Summary

This paper introduces a novel framework for generalized audio deepfake detection, f-InfoED, which leverages frame-level latent information entropy to distinguish between bonafide and spoof audio by hypothesizing differences in information content. Coupled with AdaLAM, an adapter-based approach for enhancing feature extraction from large pre-trained audio models, the method achieves state-of-the-art performance and remarkable generalization capabilities across unseen deepfake data and perturbations. The authors also release the ADFF 2024 dataset to facilitate comprehensive evaluation of modern TTS/VC methods.

Abstract

Generalizability, the capacity of a robust model to perform effectively on unseen data, is crucial for audio deepfake detection due to the rapid evolution of text-to-speech (TTS) and voice conversion (VC) technologies. A promising approach to differentiate between bonafide and spoof samples lies in identifying intrinsic disparities to enhance model generalizability. From an information-theoretic perspective, we hypothesize the information content is one of the intrinsic differences: bonafide sample represents a dense, information-rich sampling of the real world, whereas spoof sample is typically derived from lower-dimensional, less informative representations. To implement this, we introduce frame-level latent information entropy detector(f-InfoED), a framework that extracts distinctive information entropy from latent representations at the frame level to identify audio deepfakes. Furthermore, we present AdaLAM, which extends large pre-trained audio models with trainable adapters for enhanced feature extraction. To facilitate comprehensive evaluation, the audio deepfake forensics 2024 (ADFF 2024) dataset was built by the latest TTS and VC methods. Extensive experiments demonstrate that our proposed approach achieves state-of-the-art performance and exhibits remarkable generalization capabilities. Further analytical studies confirms the efficacy of AdaLAM in extracting discriminative audio features and f-InfoED in leveraging latent entropy information for more generalized deepfake detection.


Key findings
The proposed AdaLAM & f-InfoED method achieves state-of-the-art EERs on ASVspoof 2021 DF, In-the-wild, and the new ADFF 2024 datasets, demonstrating superior generalization to unseen deepfake generation methods. It also exhibits robustness to unseen perturbations like varying audio durations and MP3 compression rates. Furthermore, the underlying InfoED concept shows potential for generalization to other modalities, as evidenced by its application to image deepfake detection with ResNet-50.
Approach
The proposed f-InfoED extracts distinctive information entropy from frame-level latent representations, derived by compressing audio through a variational information bottleneck, to identify deepfakes. This is combined with AdaLAM, which extends large pre-trained audio models (like WavLM base+) with trainable adapter layers for more discriminative and generalized feature extraction. The overall framework aims to differentiate bonafide (information-rich) from spoof (less informative) audio.
Datasets
ASVspoof 2019 LA (training), ASVspoof 2021 DF, In-the-wild, ADFF 2024 (newly introduced). For the image deepfake detection experiment, datasets from [31] for ADM, DDPM, iDDPM, PNDM, SD-V1 were used.
Model(s)
AdaLAM (WavLM base+ with custom adapter layers), f-InfoED (uses mean/variance encoders via 1D CNNs, a decoder similar to HiFi-GAN, and a classifier with two FC layers). For the image deepfake detection experiment, ResNet-50 was used as the backbone.
Author countries
China