ASVspoof 5: Evaluation of Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech

Authors: Xin Wang, Héctor Delgado, Nicholas Evans, Xuechen Liu, Tomi Kinnunen, Hemlata Tak, Kong Aik Lee, Ivan Kukanov, Md Sahidullah, Massimiliano Todisco, Junichi Yamagishi

Published: 2026-01-07 14:01:10+00:00

Comment: Submitted

AI Summary

This paper presents an overview and analysis of the ASVspoof 5 challenge, which promotes research in speech spoofing and deepfake detection. It evaluates the performance of 53 participating teams' solutions against a new crowdsourced database featuring diverse generative speech technologies, recording conditions, and adversarial attacks. The findings highlight effective detection solutions but also reveal performance degradation under adversarial attacks and neural encoding, alongside persistent generalization challenges.

Abstract

ASVspoof 5 is the fifth edition in a series of challenges which promote the study of speech spoofing and deepfake detection solutions. A significant change from previous challenge editions is a new crowdsourced database collected from a substantially greater number of speakers under diverse recording conditions, and a mix of cutting-edge and legacy generative speech technology. With the new database described elsewhere, we provide in this paper an overview of the ASVspoof 5 challenge results for the submissions of 53 participating teams. While many solutions perform well, performance degrades under adversarial attacks and the application of neural encoding/compression schemes. Together with a review of post-challenge results, we also report a study of calibration in addition to other principal challenges and outline a road-map for the future of ASVspoof.


Key findings
The challenge revealed promising detection performance, but solutions consistently struggle with generalization, showing significant degradation under adversarial attacks, neural encoding/compression, and when tested on out-of-domain datasets. This suggests current systems often overfit to specific attack characteristics and training datasets, indicating a bottleneck in architectural innovation and a critical need for more principled data design, robust fusion strategies, and advanced training paradigms to achieve true generalization.
Approach
The paper analyzes submissions to the ASVspoof 5 challenge, which included two tracks: stand-alone spoof/deepfake detection and spoofing-robust automatic speaker verification (ASV). It details the challenge's evaluation setup, new crowdsourced database, and metrics, then presents an in-depth analysis of the results from participating teams, focusing on top-performing systems, the impact of various attack types and codecs, and score calibration issues.
Datasets
ASVspoof 5 database (derived from Multilingual Librispeech (MLS)), Voxceleb2 (for ASV baseline training). Cross-dataset evaluation also used ASVspoof 2015, ASVspoof 2019 LA, ASVspoof 2021 LA, ASVspoof 2021 DF, and In-the-wild (ITW) datasets.
Model(s)
RawNet2, AASIST, ECAPA-TDNN (for ASV baseline), ResNet, Transformer, ConvViT-Base, Wav2vec 2.0, WavLM, GAT, MFA-Res2Net, LSTM, MLP, LCNN, GNN, Conformer.
Author countries
Japan, Spain, France, Finland, USA, Hong Kong, Singapore, India