Are audio DeepFake detection models polyglots?

Authors: Bartłomiej Marek, Piotr Kawa, Piotr Syga

Published: 2024-12-23 19:32:53+00:00

AI Summary

This research benchmarks multilingual audio deepfake detection by evaluating various adaptation strategies. Experiments analyzing models trained on English datasets, along with intra- and cross-linguistic adaptations, reveal significant variations in detection efficacy, highlighting the importance of target-language data.

Abstract

Since the majority of audio DeepFake (DF) detection methods are trained on English-centric datasets, their applicability to non-English languages remains largely unexplored. In this work, we present a benchmark for the multilingual audio DF detection challenge by evaluating various adaptation strategies. Our experiments focus on analyzing models trained on English benchmark datasets, as well as intra-linguistic (same-language) and cross-linguistic adaptation approaches. Our results indicate considerable variations in detection efficacy, highlighting the difficulties of multilingual settings. We show that limiting the dataset to English negatively impacts the efficacy, while stressing the importance of the data in the target language.


Key findings
English-only trained models show varying performance across languages, with some non-English languages exhibiting better detection than English. Intra-linguistic adaptation (fine-tuning with target-language data) significantly improves detection accuracy. Cross-linguistic adaptation yields mixed results, with performance sometimes worse than using English-trained models directly.
Approach
The study evaluates three adaptation strategies: using English-trained models directly on non-English audio, training models from scratch on limited target-language data, and fine-tuning English-trained models with target-language data. Performance is assessed across multiple languages from different language families.
Datasets
ASVspoof2019 LA (English benchmark dataset), M-AILABS Speech Dataset, Multi-Language Audio Anti-Spoofing Dataset (MLAAD)
Model(s)
W2V+AASIST, LFCC+AASIST, LFCC+MesoNet, RawGAT-ST, Whisper+AASIST
Author countries
Poland, Germany