Harder or Different? Understanding Generalization of Audio Deepfake Detection

View on arXiv ← Back to list

Authors: Nicolas M. Müller, Nicholas Evans, Hemlata Tak, Philip Sperl, Konstantin Böttinger

Published: 2024-06-05 10:33:15+00:00

AI Summary

This research investigates the generalization problem in audio deepfake detection, determining whether poor performance on unseen deepfakes is due to increased difficulty ('hardness') or fundamental differences ('difference') between deepfake generation methods. The study finds that performance gaps are primarily attributed to 'difference', implying that simply increasing model capacity is insufficient for robust generalization.

Abstract

Recent research has highlighted a key issue in speech deepfake detection: models trained on one set of deepfakes perform poorly on others. The question arises: is this due to the continuously improving quality of Text-to-Speech (TTS) models, i.e., are newer DeepFakes just 'harder' to detect? Or, is it because deepfakes generated with one model are fundamentally different to those generated using another model? We answer this question by decomposing the performance gap between in-domain and out-of-domain test data into 'hardness' and 'difference' components. Experiments performed using ASVspoof databases indicate that the hardness component is practically negligible, with the performance gap being attributed primarily to the difference component. This has direct implications for real-world deepfake detection, highlighting that merely increasing model capacity, the currently-dominant research trend, may not effectively address the generalization challenge.

Key findings

The majority of the performance gap in audio deepfake detection stems from the 'difference' between training and testing datasets, rather than increased deepfake difficulty. Increasing model capacity alone is ineffective for solving the generalization problem. The 'difference' gap is particularly pronounced when dealing with unseen deepfake generation methods.

Approach

The authors decompose the performance gap between in-domain and out-of-domain audio deepfake detection into 'hardness' and 'difference' components. They use various deepfake detection models trained on ASVspoof 2019 and tested on ASVspoof 2021 and In-the-Wild datasets to quantify these components.

Datasets

ASVspoof 2019 LA, ASVspoof 2021 LA, ASVspoof 2021 DF, In-the-Wild

Model(s)

LCNN, RawNet2, Whisper-DF, SSL-W2V2

Author countries

Germany, France, USA

← Previous