BreathNet: Generalizable Audio Deepfake Detection via Breath-Cue-Guided Feature Refinement

Authors: Zhe Ye, Xiangui Kang, Jiayi He, Chengxin Chen, Wei Zhu, Kai Wu, Yin Yang, Jiwu Huang

Published: 2026-02-14 04:26:37+00:00

Comment: Under Review

AI Summary

This paper introduces BreathNet, a novel audio deepfake detection framework designed to improve generalization by integrating fine-grained breath information and frequency-domain features. It proposes BreathFiLM, a feature-wise linear modulation mechanism guided by breathing sounds, and a suite of feature losses (PSCL, center loss, contrast loss) to enhance discriminative ability in the feature space. BreathNet achieves state-of-the-art performance on multiple benchmark datasets, demonstrating strong generalization capabilities without requiring breath masks during inference.

Abstract

As deepfake audio becomes more realistic and diverse, developing generalizable countermeasure systems has become crucial. Existing detection methods primarily depend on XLS-R front-end features to improve generalization. Nonetheless, their performance remains limited, partly due to insufficient attention to fine-grained information, such as physiological cues or frequency-domain features. In this paper, we propose BreathNet, a novel audio deepfake detection framework that integrates fine-grained breath information to improve generalization. Specifically, we design BreathFiLM, a feature-wise linear modulation mechanism that selectively amplifies temporal representations based on the presence of breathing sounds. BreathFiLM is trained jointly with the XLS-R extractor, in turn encouraging the extractor to learn and encode breath-related cues into the temporal features. Then, we use the frequency front-end to extract spectral features, which are then fused with temporal features to provide complementary information introduced by vocoders or compression artifacts. Additionally, we propose a group of feature losses comprising Positive-only Supervised Contrastive Loss (PSCL), center loss, and contrast loss. These losses jointly enhance the discriminative ability, encouraging the model to separate bona fide and deepfake samples more effectively in the feature space. Extensive experiments on five benchmark datasets demonstrate state-of-the-art (SOTA) performance. Using the ASVspoof 2019 LA training set, our method attains 1.99% average EER across four related eval benchmarks, with particularly strong performance on the In-the-Wild dataset, where it achieves 4.70% EER. Moreover, under the ASVspoof5 evaluation protocol, our method achieves an EER of 4.94% on this latest benchmark.

Key findings

BreathNet achieves state-of-the-art performance on five benchmark datasets, notably attaining 1.99% average EER across four related eval benchmarks (19LA, 21LA, 21DF, ITW). It shows particularly strong generalization on the In-the-Wild (ITW) dataset with 4.70% EER and on the latest ASVspoof5 benchmark with 4.94% EER. Ablation studies confirm the individual contributions of BreathFiLM, frequency features, and the proposed feature losses to the overall performance gains and generalization ability.

Approach

The proposed BreathNet uses a dual-branch architecture, extracting temporal features via a fine-tuned XLS-R model with a BreathFiLM module and spectral features using a SincConv-based DFIM module. These features are then fused using cross-attention, and a group of feature losses (Positive-only Supervised Contrastive Loss, center loss, and contrast loss) is applied to enhance feature separability between bona fide and deepfake samples, particularly focusing on intra-class compactness for bona fide and inter-class separation.

Datasets

ASVspoof 2019 LA (19LA), ASVspoof 2021 LA (21LA), ASVspoof 2021 DF (21DF), In-the-Wild (ITW), ASVspoof5

Model(s)

XLS-R (0.3B parameters), BreathFiLM (with MLP), DFIM (SincConv-based), BiLSTM, Cross-attention

Author countries

China

← Previous