Multilingual Source Tracing of Speech Deepfakes: A First Benchmark

Authors: Xi Xuan, Yang Xiao, Rohan Kumar Das, Tomi Kinnunen

Published: 2025-08-06 07:11:36+00:00

AI Summary

This paper introduces the first benchmark for multilingual speech deepfake source tracing, evaluating models' ability to identify the source model used to generate deepfake speech across different languages and speakers. The benchmark uses a new dataset and protocols to comprehensively assess model generalization in mono- and cross-lingual scenarios.

Abstract

Recent progress in generative AI has made it increasingly easy to create natural-sounding deepfake speech from just a few seconds of audio. While these tools support helpful applications, they also raise serious concerns by making it possible to generate convincing fake speech in many languages. Current research has largely focused on detecting fake speech, but little attention has been given to tracing the source models used to generate it. This paper introduces the first benchmark for multilingual speech deepfake source tracing, covering both mono- and cross-lingual scenarios. We comparatively investigate DSP- and SSL-based modeling; examine how SSL representations fine-tuned on different languages impact cross-lingual generalization performance; and evaluate generalization to unseen languages and speakers. Our findings offer the first comprehensive insights into the challenges of identifying speech generation models when training and inference languages differ. The dataset, protocol and code are available at https://github.com/xuanxixi/Multilingual-Source-Tracing.


Key findings
SSL models fine-tuned on specific languages showed the best monolingual performance. LFCC features combined with ECAPA-TDNN backends demonstrated superior cross-lingual generalization. Cross-lingual performance was better within the same language family, but significant variations existed across language pairs.
Approach
The authors compare DSP- and SSL-based models for source tracing. They evaluate performance across mono- and cross-lingual scenarios, including unseen languages and speakers using various experimental protocols. Metrics used include Macro-F1.
Datasets
MCL-MLAAD (a refined version of the MLAAD dataset), containing synthetic speech in six languages (English, German, French, Italian, Polish, Russian) generated by four TTS architectures, with noise perturbations added.
Model(s)
DSP-based models: LFCC-ResNet18, LFCC-AASIST, LFCC-ECAPA-TDNN. SSL-based models: wav2vec2.0 Large LV-60, XLS-R-300M, and six language-specific fine-tuned variants of XLS-R.
Author countries
Finland, Hong Kong SAR, Australia, Singapore