Rapidly Adapting to New Voice Spoofing: Few-Shot Detection of Synthesized Speech Under Distribution Shifts

Authors: Ashi Garg, Zexin Cai, Henry Li Xinyuan, Leibny Paola García-Perera, Kevin Duh, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews

Published: 2025-08-18 19:14:45+00:00

AI Summary

This paper proposes a self-attentive prototypical network for few-shot detection of synthesized speech, designed to rapidly adapt to new voice spoofing under distribution shifts. The method effectively leverages a small number of in-distribution samples to significantly improve detection performance over traditional zero-shot detectors. It achieves up to 32% relative EER reduction on deepfakes in Japanese language and 20% on the ASVspoof 2021 Deepfake dataset.

Abstract

We address the challenge of detecting synthesized speech under distribution shifts -- arising from unseen synthesis methods, speakers, languages, or audio conditions -- relative to the training data. Few-shot learning methods are a promising way to tackle distribution shifts by rapidly adapting on the basis of a few in-distribution samples. We propose a self-attentive prototypical network to enable more robust few-shot adaptation. To evaluate our approach, we systematically compare the performance of traditional zero-shot detectors and the proposed few-shot detectors, carefully controlling training conditions to introduce distribution shifts at evaluation time. In conditions where distribution shifts hamper the zero-shot performance, our proposed few-shot adaptation technique can quickly adapt using as few as 10 in-distribution samples -- achieving upto 32% relative EER reduction on deepfakes in Japanese language and 20% relative reduction on ASVspoof 2021 Deepfake dataset.


Key findings
Few-shot adaptation significantly improves detection, achieving up to 32% relative EER reduction on Japanese deepfakes and 20% on ASVspoof 2021 Deepfake dataset with as few as 10 in-distribution samples. The self-attentive pooling mechanism consistently enhances performance over standard mean-based prototype aggregation and prior few-shot baselines. Few-shot learning generally outperforms supervised fine-tuning in low-data conditions (N=10), demonstrating strong generalization under distribution shifts.
Approach
The authors propose a self-attentive prototypical network that uses self-supervised learning (SSL) pre-trained speech representations. It computes more discriminative class prototypes by applying a multi-head self-attention mechanism over the support embeddings to capture inter-sample dependencies, enabling robust few-shot adaptation to novel test conditions.
Datasets
ASVspoof 2019 (training), ASVspoof 2021, ShiftySpeech, In-the-Wild (ITW), CodecFake (evaluation)
Model(s)
SSL-AASIST (backbone, integrating Wav2Vec 2.0 XLSR as front-end and a spectro-temporal graph attention network back-end), Self-attentive prototypical network
Author countries
USA