Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes

Authors: Neta Glazer, David Chernin, Idan Achituve, Sharon Gannot, Ethan Fetaya

Published: 2025-05-29 16:26:32+00:00

AI Summary

This paper introduces ADD-GP, a few-shot adaptive framework based on a Gaussian Process (GP) classifier for Audio Deepfake Detection (ADD). The approach combines a powerful deep embedding model (XLS-R) with the flexibility of Gaussian Processes to achieve strong performance and efficient adaptation to unseen Text-to-Speech (TTS) models with minimal data. It also demonstrates applicability for personalized detection with increased robustness and one-shot adaptability.

Abstract

Recent advancements in Text-to-Speech (TTS) models, particularly in voice cloning, have intensified the demand for adaptable and efficient deepfake detection methods. As TTS systems continue to evolve, detection models must be able to efficiently adapt to previously unseen generation models with minimal data. This paper introduces ADD-GP, a few-shot adaptive framework based on a Gaussian Process (GP) classifier for Audio Deepfake Detection (ADD). We show how the combination of a powerful deep embedding model with the Gaussian processes flexibility can achieve strong performance and adaptability. Additionally, we show this approach can also be used for personalized detection, with greater robustness to new TTS models and one-shot adaptability. To support our evaluation, a benchmark dataset is constructed for this task using new state-of-the-art voice cloning models.


Key findings
ADD-GP significantly outperforms baselines in few-shot adaptation to new, unseen TTS models, achieving lower Equal Error Rates (EERs) even with very few samples (e.g., 0.54% EER with 100 shots using MixPro). It exhibits better retention of performance on in-distribution TTS models compared to baselines, mitigating catastrophic forgetting. The framework also enables effective personalized deepfake detection with excellent one-shot adaptability and provides well-calibrated uncertainty estimates.
Approach
The proposed ADD-GP framework uses a Dirichlet-based Gaussian Process (GP) classifier as a back-end, operating on features extracted from XLS-R, a Wav2Vec2-based self-supervised model, as the front-end. It leverages Deep Kernel Learning (DKL) where an RBF kernel is applied to the XLS-R embeddings, with only the last block of XLS-R and kernel parameters updated during training. For few-shot adaptation, the kernel is fixed, and the GP is updated with new examples, optionally enhanced with MixPro data augmentation.
Datasets
LibriFake (a new benchmark derived from LibriSpeech using various TTS models like yourTTS, Whisper-Speech, Vall-e-x, F5-TTS, and 11Labs), VoxCeleb.
Model(s)
XLS-R (as the deep embedding model/front-end), Gaussian Process (GP) classifier (Dirichlet-based GP with RBF kernel), AASIST (as part of SSL-AASIST baseline), Wav2Vec2 (as a backbone for baselines).
Author countries
Israel