Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes

View on arXiv ← Back to list

Authors: Neta Glazer, David Chernin, Idan Achituve, Sharon Gannot, Ethan Fetaya

Published: 2025-05-29 16:26:32+00:00

AI Summary

This paper introduces ADD-GP, a few-shot adaptive framework for audio deepfake detection using a Gaussian Process (GP) classifier. ADD-GP combines a deep embedding model with GP's flexibility to achieve strong performance and adaptability to new, unseen voice cloning models with minimal data.

Abstract

Recent advancements in Text-to-Speech (TTS) models, particularly in voice cloning, have intensified the demand for adaptable and efficient deepfake detection methods. As TTS systems continue to evolve, detection models must be able to efficiently adapt to previously unseen generation models with minimal data. This paper introduces ADD-GP, a few-shot adaptive framework based on a Gaussian Process (GP) classifier for Audio Deepfake Detection (ADD). We show how the combination of a powerful deep embedding model with the Gaussian processes flexibility can achieve strong performance and adaptability. Additionally, we show this approach can also be used for personalized detection, with greater robustness to new TTS models and one-shot adaptability. To support our evaluation, a benchmark dataset is constructed for this task using new state-of-the-art voice cloning models.

Key findings

ADD-GP significantly outperforms baselines (SSL-AASIST, CD-ADD, OWM, RWM) in few-shot adaptation to unseen TTS models. It demonstrates superior performance in personalized detection, achieving excellent results even with one-shot adaptation. The model provides well-calibrated uncertainty estimates.

Approach

ADD-GP uses a Gaussian Process (GP) classifier as a back-end on features extracted from the XLS-R model. The framework adapts to new TTS models with few-shot learning by adding new data to the GP classifier and utilizing MixPro data augmentation for improved generalization.

Datasets

LibriFake (a new benchmark dataset created using LibriSpeech and several state-of-the-art voice cloning models including yourTTS, Whisper-Speech, Vall-e-x, F5-TTS, and 11Labs), VoxCeleb

Model(s)

Gaussian Process (GP) classifier with RBF kernel, XLS-R (as a feature extractor)

Author countries

Israel

← Previous