HierCon: Hierarchical Contrastive Attention for Audio Deepfake Detection

Authors: Zhili Nicholas Liang, Soyeon Caren Han, Qizhou Wang, Christopher Leckie

Published: 2026-02-01 05:36:32+00:00

Comment: Proceedings of The Web Conference 2026 (WWW'26), short track

AI Summary

This paper introduces HierCon, a hierarchical contrastive attention framework for audio deepfake detection, addressing limitations of existing methods that overlook temporal and hierarchical dependencies in multi-layer representations. HierCon models dependencies across temporal frames, neighboring layers, and layer groups, combined with margin-based contrastive learning to encourage domain-invariant embeddings. The method achieves state-of-the-art performance on ASVspoof 2021 DF and In-the-Wild datasets, demonstrating improved generalization to cross-domain generation techniques.

Abstract

Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state-of-the-art self-supervised models provide rich multi-layer representations, existing detectors treat layers independently and overlook temporal and hierarchical dependencies critical for identifying synthetic artefacts. We propose HierCon, a hierarchical layer attention framework combined with margin-based contrastive learning that models dependencies across temporal frames, neighbouring layers, and layer groups, while encouraging domain-invariant embeddings. Evaluated on ASVspoof 2021 DF and In-the-Wild datasets, our method achieves state-of-the-art performance (1.93% and 6.87% EER), improving over independent layer weighting by 36.6% and 22.5% respectively. The results and attention visualisations confirm that hierarchical modelling enhances generalisation to cross-domain generation techniques and recording conditions.


Key findings
HierCon achieved state-of-the-art performance, with EERs of 1.93% on ASVspoof 2021 DF and 6.87% on In-the-Wild, representing significant improvements (36.6% and 22.5% relative respectively) over independent layer weighting. Ablation studies confirmed that both hierarchical attention and contrastive learning contribute complementary benefits, enhancing generalization and providing interpretable insights into which temporal regions and layer groups drive predictions.
Approach
The proposed HierCon framework leverages multi-layer representations from a self-supervised model (XLS-R) and applies a three-stage hierarchical attention mechanism: temporal attention within each layer, intra-group attention across neighboring layers, and inter-group attention across broader layer clusters. This is combined with margin-based contrastive learning to enforce domain-invariant embedding geometry, jointly optimizing classification and contrastive objectives to enhance robustness and prevent reliance on dataset-specific artifacts.
Datasets
ASVspoof 2019 LA subset (for training), ASVspoof 2021 LA, ASVspoof 2021 DF, In-the-Wild (ITW)
Model(s)
XLS-R 300M (as feature backbone), Transformer network (24 layers)
Author countries
Australia