Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection

Authors: Jaskirat Sudan, Hashim Ali, Surya Subramani, Hafiz Malik

Published: 2026-04-28 18:52:38+00:00

AI Summary

This paper conducts a controlled study on supervised contrastive learning (SupCon) for deepfake audio detection, investigating the impact of similarity functions (cosine vs. angular) and negative scaling via a warm-started global cross-batch queue. Using a two-stage pipeline with a wav2vec2 XLS-R backbone, they find that similarity choice and temperature are coupled, and optimal negative scaling depends on the chosen similarity.

Abstract

Supervised contrastive learning (SupCon) is widely used to shape representations, but has seen limited targeted study for audio deepfake detection. Existing work typically combines contrastive terms with broader pipelines; however, the focus on SupCon itself is missing. In this work, we run a controlled study on wav2vec2 XLS-R (300M) that varies (i) similarity in SupCon (cosine vs angular similarity derived from the hyperspherical angle) and (ii) negative scaling using a warm-started global cross-batch queue. Stage 1 fine-tunes the encoder and projection head with SupCon; Stage 2 freezes them and trains a linear classifier with BCE. Trained on ASVspoof 2019 LA and evaluated on ASV19 eval plus ITW and ASVspoof 2021 DF/LA, Cosine SupCon with a delayed queue achieves the best ITW EER (8.29%) and pooled EER (4.44), while angular similarity performs strongly without queued negatives (ITW 8.70), indicating reduced reliance on large negative sets.


Key findings
Cosine SupCon with a large delayed queue (|Q|=2048) achieved the best pooled EER (4.44%), while a queue of |Q|=4096 yielded the lowest ITW EER (8.29%). Angular similarity performed strongly without queued negatives (ITW 8.70%) but degraded with larger queues, indicating reduced reliance on large negative sets. The optimal temperature for SupCon varied significantly between cosine (τ=0.30) and angular (τ=0.07) similarities.
Approach
The authors employ a two-stage pipeline with a wav2vec2 XLS-R encoder. Stage 1 fine-tunes the encoder and a projection head using a Supervised Contrastive (SupCon) objective, where they vary the similarity function (cosine vs. angular/geodesic) and scale negatives with a delayed global cross-batch queue. Stage 2 freezes the trained components and trains a linear classifier with Binary Cross-Entropy (BCE).
Datasets
ASVspoof 2019 LA (train, dev, eval), In-the-Wild (ITW), ASVspoof 2021 DF, ASVspoof 2021 LA
Model(s)
wav2vec2 XLS-R (300M), Linear projection head, Linear classifier
Author countries
USA