SONAR: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection

Authors: Ido Nitzan HIdekel, Gal lifshitz, Khen Cohen, Dan Raviv

Published: 2025-11-26 12:16:38+00:00

AI Summary

Deepfake (DF) audio detectors still struggle to generalize to out-of-distribution inputs due to spectral bias, which causes generators to leave high-frequency (HF) artifacts under-exploited by detectors. To address this, SONAR proposes a frequency-guided framework that explicitly disentangles an audio signal into low-frequency content and HF residuals using XLSR encoders and learnable SRM filters. By employing frequency cross-attention and a frequency-aware Jensen-Shannon contrastive loss, SONAR aligns real content-noise pairs while pushing fake embeddings apart, achieving state-of-the-art generalization and significantly faster convergence.

Abstract

Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts and leaves those same artifacts under-exploited by common detectors. To address this gap, we propose Spectral-cONtrastive Audio Residuals (SONAR), a frequency-guided framework that explicitly disentangles an audio signal into complementary representations. An XLSR encoder captures the dominant low-frequency content, while the same cloned path, preceded by learnable SRM, value-constrained high-pass filters, distills faint HF residuals. Frequency cross-attention reunites the two views for long- and short-range frequency dependencies, and a frequency-aware Jensen-Shannon contrastive loss pulls real content-noise pairs together while pushing fake embeddings apart, accelerating optimization and sharpening decision boundaries. Evaluated on the ASVspoof 2021 and in-the-wild benchmarks, SONAR attains state-of-the-art performance and converges four times faster than strong baselines. By elevating faint high-frequency residuals to first-class learning signals, SONAR unveils a fully data-driven, frequency-guided contrastive framework that splits the latent space into two disjoint manifolds: natural-HF for genuine audio and distorted-HF for synthetic audio, thereby sharpening decision boundaries. Because the scheme operates purely at the representation level, it is architecture-agnostic and, in future work, can be seamlessly integrated into any model or modality where subtle high-frequency cues are decisive.


Key findings
SONAR achieves state-of-the-art performance with new best EERs on ASVspoof 2021 (DF, LA) and In-the-Wild benchmarks. It converges significantly faster (up to 4-8 times) than strong baselines while maintaining robustness to codecs and bandwidth shifts. The framework effectively disentangles latent space into distinct manifolds for genuine and synthetic audio, improving generalization by elevating high-frequency residuals as discriminative signals.
Approach
SONAR uses a dual-path architecture: an XLSR encoder extracts low-frequency content, while a parallel path with learnable SRM high-pass filters distills high-frequency residuals. These complementary representations are fused via frequency cross-attention to capture long- and short-range dependencies. A frequency-aware Jensen-Shannon contrastive loss then pulls genuine content-noise embeddings together and pushes fake ones apart, sharpening decision boundaries by exploiting differences in LF-HF statistical coupling.
Datasets
ASVspoof 2019 Logical Access (LA) training set, ASVspoof 2021 competition datasets (LA and Deep Fake scenarios), In The Wild dataset
Model(s)
Wav2Vec2.0 XLSR Encoder, Learnable SRM filters, AASIST classifier, XLSR-Mamba (for SONAR-Finetune), Lightweight two-layer MLP (for SONAR-Lite)
Author countries
Israel