Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models

Authors: Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva

Published: 2024-05-03 15:27:11+00:00

AI Summary

This paper proposes a training-free approach for audio deepfake detection leveraging large-scale pre-trained models. The method reformulates the problem as speaker verification, identifying fake audios through mismatch with a reference set of the claimed speaker's voice. This eliminates the need for training on fake data, ensuring generalization to unseen synthesis methods.

Abstract

Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is very important to design techniques that work well also on data they were not trained for. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection, with special focus on generalization ability. To this end, the detection problem is reformulated in a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root, and ensuring full generalization ability. Features are extracted by general-purpose large pre-trained models, with no need for training or fine-tuning on specific fake detection or speaker verification datasets. At detection time only a limited set of voice fragments of the identity under test is required. Experiments on several datasets widespread in the community show that detectors based on pre-trained models achieve excellent performance and show strong generalization ability, rivaling supervised methods on in-distribution data and largely overcoming them on out-of-distribution data.


Key findings
The training-free approach using pre-trained models achieved excellent performance, rivaling supervised methods on in-distribution data and significantly outperforming them on out-of-distribution data. The BEATs model demonstrated superior performance and generalization, achieving the best results across all metrics.
Approach
The approach uses a speaker verification framework. Features are extracted from audio using pre-trained models (Wav2Vec2-xlsr, AudioCLIP, LaionCLAP, BEATs). A decision is made by comparing the similarity between the test audio's embedding and the embeddings of a reference set of genuine audio from the claimed speaker.
Datasets
ASVSpoof2019, ASVSpoof2021, InTheWild
Model(s)
Wav2Vec2-xlsr, AudioCLIP, LaionCLAP, BEATs
Author countries
Italy