Rapidly Adapting to New Voice Spoofing: Few-Shot Detection of Synthesized Speech Under Distribution Shifts

View on arXiv ← Back to list

Authors: Ashi Garg, Zexin Cai, Henry Li Xinyuan, Leibny Paola García-Perera, Kevin Duh, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews

Published: 2025-08-18 19:14:45+00:00

AI Summary

This paper proposes a self-attentive prototypical network for few-shot detection of synthesized speech, addressing the challenge of detecting synthesized speech under distribution shifts. The approach significantly improves performance over existing zero-shot detectors by adapting quickly using as few as 10 in-distribution samples, achieving up to a 32% relative EER reduction on certain datasets.

Abstract

We address the challenge of detecting synthesized speech under distribution shifts -- arising from unseen synthesis methods, speakers, languages, or audio conditions -- relative to the training data. Few-shot learning methods are a promising way to tackle distribution shifts by rapidly adapting on the basis of a few in-distribution samples. We propose a self-attentive prototypical network to enable more robust few-shot adaptation. To evaluate our approach, we systematically compare the performance of traditional zero-shot detectors and the proposed few-shot detectors, carefully controlling training conditions to introduce distribution shifts at evaluation time. In conditions where distribution shifts hamper the zero-shot performance, our proposed few-shot adaptation technique can quickly adapt using as few as 10 in-distribution samples -- achieving upto 32% relative EER reduction on deepfakes in Japanese language and 20% relative reduction on ASVspoof 2021 Deepfake dataset.

Key findings

The proposed few-shot method significantly outperforms zero-shot detectors and other few-shot baselines under distribution shifts. The self-attentive prototype aggregation improves performance, especially in low-shot scenarios. Few-shot adaptation is superior to supervised fine-tuning with limited data, while supervised fine-tuning becomes competitive with larger datasets.

Approach

The authors propose a self-attentive prototypical network for few-shot learning. This network learns a mapping from a set of in-distribution samples to a class prototype, enabling robust adaptation to unseen synthesis methods. The self-attention mechanism improves prototype representation by considering inter-sample dependencies.

Datasets

ASVspoof 2019, ASVspoof 2021, ShiftySpeech, In-the-Wild (ITW), CodecFake

Model(s)

Self-attentive prototypical network using SSL-AASIST (which uses Wav2Vec 2.0 XLSR-53) as a backbone. Baselines include AASIST, anomaly detection using Mahalanobis distance, and a supervised fine-tuning approach.

Author countries

USA

← Previous