Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

Authors: Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alumäe, Mathew Magimai Doss

Published: 2026-03-06 11:16:55+00:00

Comment: Submitted to Interspeech 2026, 4 pages, 2 figures

AI Summary

This paper investigates the role of compact self-supervised learning (SSL) backbones for audio deepfake detection using RAPTOR, a pairwise-gated fusion detector. The study reveals that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, allowing 100M models to perform comparably to larger and commercial systems. Furthermore, a test-time augmentation protocol exposes overconfident miscalibration in WavLM variants, highlighting the importance of SSL pre-training trajectory over model scale for reliable detection.

Abstract

Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.


Key findings
Iterative multilingual SSL pre-training, particularly with mHuBERT, is a first-order factor for cross-domain audio deepfake detection robustness, enabling compact 100M models to match or outperform larger (300M+) and commercial systems. Model scale alone does not guarantee superior performance. Test-time augmentation (TTA) with aleatoric uncertainty reveals systematic overconfident miscalibration in WavLM variants under perturbation, a critical deployment risk not visible with standard EER metrics, contrasting with the stable calibration of mHuBERT variants.
Approach
The authors employ RAPTOR, a pairwise-gated hierarchical layer-fusion architecture, as a unified downstream detector. This architecture fuses hidden representations from SSL encoders using learned gating stages, followed by attention pooling and a binary classifier. Consistency regularization is applied to the gating distributions, and a test-time augmentation protocol with perturbation-based aleatoric uncertainty is introduced to assess model calibration.
Datasets
ASVspoof 2019, ASVspoof 2024, CodecFake, LibriSeVoc, DFADD, CTRSVDD, SpoofCeleb, MLAAD, EnvSDD (for training under Protocol 2/Speech DF Arena); ASVspoof (2019, 2021LA/DF, 2024), ADD (2022, 2023), CodecFake, LibriSeVoc, SONAR, FoR, DFADD, ITW (for evaluation). Pre-training data includes LibriSpeech, and large-scale multi-language datasets for mHuBERT and WavLM+.
Model(s)
HuBERT-Base, mHuBERT-Iter1, mHuBERT-Iter2, mHuBERT-Final, WavLM-Base, WavLM-Base+ (as SSL backbones); RAPTOR (as the unified layer-fusion detector).
Author countries
Switzerland, Estonia, UAE