Where are we in audio deepfake detection? A systematic analysis over generative and detection models

Authors: Xiang Li, Pin-Yu Chen, Wenqi Wei

Published: 2024-10-06 01:03:42+00:00

AI Summary

This paper introduces SONAR, a framework for benchmarking AI-synthesized audio detection models. SONAR uses a novel dataset from 9 diverse audio synthesis platforms and evaluates both traditional and foundation model-based detection systems, revealing that foundation models show stronger generalization capabilities.

Abstract

Recent advances in Text-to-Speech (TTS) and Voice-Conversion (VC) using generative Artificial Intelligence (AI) technology have made it possible to generate high-quality and realistic human-like audio. This poses growing challenges in distinguishing AI-synthesized speech from the genuine human voice and could raise concerns about misuse for impersonation, fraud, spreading misinformation, and scams. However, existing detection methods for AI-synthesized audio have not kept pace and often fail to generalize across diverse datasets. In this paper, we introduce SONAR, a synthetic AI-Audio Detection Framework and Benchmark, aiming to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content. SONAR includes a novel evaluation dataset sourced from 9 diverse audio synthesis platforms, including leading TTS providers and state-of-the-art TTS models. It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems. Through extensive experiments, (1) we reveal the limitations of existing detection methods and demonstrate that foundation models exhibit stronger generalization capabilities, likely due to their model size and the scale and quality of pretraining data. (2) Speech foundation models demonstrate robust cross-lingual generalization capabilities, maintaining strong performance across diverse languages despite being fine-tuned solely on English speech data. This finding also suggests that the primary challenges in audio deepfake detection are more closely tied to the realism and quality of synthetic audio rather than language-specific characteristics. (3) We explore the effectiveness and efficiency of few-shot fine-tuning in improving generalization, highlighting its potential for tailored applications, such as personalized detection systems for specific entities or individuals.


Key findings
Foundation models demonstrated significantly stronger generalization capabilities than traditional models across various datasets and languages, likely due to their size and pretraining data. Few-shot fine-tuning effectively improved performance on specific datasets but also highlighted the challenge of catastrophic forgetting. The primary challenges in audio deepfake detection seem linked to synthetic audio realism rather than language specifics.
Approach
The authors created SONAR, a benchmark framework, with a new dataset from 9 diverse audio synthesis platforms. They evaluated 11 state-of-the-art models (5 traditional and 6 foundation models) on this dataset and several existing datasets to analyze generalization capabilities across datasets and languages. Few-shot fine-tuning was also explored to enhance generalization.
Datasets
A novel dataset from 9 audio synthesis platforms (including OpenAI, xTTS, AudioGen, Seed-TTS, VALL-E, PromptTTS2, NaturalSpeech3, VoiceBox, FlashSpeech); WaveFake; LibriSeVoc; In-the-Wild; MLAAD; ASVSpoof2019 (subset). LibriTTS used for real audio samples.
Model(s)
AASIST, RawGAT-ST, RawNet2, Spectrogram+ResNet, LFCC-LCNN, Wave2Vec2, Wave2Vec2BERT, HuBERT, CLAP, Whisper-small, Whisper-large, Whisper-tiny, Whisper-base, Whisper-medium
Author countries
USA