Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR
Authors: Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alumäe, Mathew Magimai Doss
Published: 2026-03-06 11:16:55+00:00
Comment: Submitted to Interspeech 2026, 4 pages, 2 figures
AI Summary
This paper investigates the role of compact self-supervised learning (SSL) backbones for audio deepfake detection using RAPTOR, a pairwise-gated fusion detector. The study reveals that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, allowing 100M models to perform comparably to larger and commercial systems. Furthermore, a test-time augmentation protocol exposes overconfident miscalibration in WavLM variants, highlighting the importance of SSL pre-training trajectory over model scale for reliable detection.
Abstract
Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.