A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

Authors: Hashim Ali, Nithin Sai Adupa, Surya Subramani, Hafiz Malik

Published: 2026-03-02 05:45:55+00:00

Comment: Accepted at ICASSP

AI Summary

This paper introduces Spoof-SUPERB, a new benchmark for audio deepfake detection that systematically evaluates 20 self-supervised learning (SSL) models across various architectures and pretraining objectives. The benchmark assesses performance on multiple in-domain and out-of-domain datasets, including robustness under acoustic degradations. Results show that large-scale discriminative models like XLS-R, UniSpeech-SAT, and WavLM Large consistently achieve superior performance and resilience, benefiting from multilingual pretraining and speaker-aware objectives.

Abstract

Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.

Key findings

Large-scale discriminative SSL models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform generative and hybrid models, achieving the lowest mean Equal Error Rates (EERs). These top-performing models benefit from multilingual pretraining, speaker-aware objectives, and overall model scale. Discriminative models also demonstrated greater resilience under acoustic degradations (noise, reverberation, codec conditions) compared to generative approaches, which degraded sharply.

Approach

The authors introduce Spoof-SUPERB, a SUPERB-style benchmark, for audio deepfake detection. It evaluates 20 pre-trained SSL models (generative, discriminative, and hybrid) by using their frozen front-ends to extract utterance-level representations, which are then passed to a lightweight, trainable fully connected classifier for binary spoof/bona-fide prediction. The models are trained on ASVspoof 2019 LA and evaluated across multiple in-domain and out-of-domain datasets, including those with acoustic degradations.

Datasets

ASVspoof 2019 (ASV19) LA - train, ASVspoof 2019 (ASV19) LA - eval, ASVspoof 2021 (ASV21) LA, ASVspoof 2021 (ASV21) DF, DeepfakeEval (DFEval) 2024, In-the-Wild (ITW), Famous Figures, ASVSpoof Laundered Database (ASVSpoofLD), ASVspoof 5 (ASV5) Eval

Model(s)

APC, VQ-APC, NPC, Mockingjay, TERA, DeCoAR 2.0, wav2vec, wav2vec 2.0 Base, wav2vec 2.0 Large, HuBERT Base, HuBERT Large, MR-HuBERT, XLS-R, UniSpeech-SAT, Data2Vec, WAVLABLM, WavLM Large, SSAST, MAE-AST-FRAME (with FBANK as baseline)

Author countries

USA

← Previous