MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

View on arXiv ← Back to list

Authors: Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, Edresson Casanova, Eren Gölge, Thorsten Müller, Piotr Syga, Philip Sperl, Konstantin Böttinger

Published: 2024-01-17 15:09:02+00:00

AI Summary

The paper introduces MLAAD v7, a multi-language audio anti-spoofing dataset containing 485.3 hours of synthetic speech in 40 languages generated using 101 TTS models. Experiments show that models trained on MLAAD achieve superior performance compared to models trained on other datasets, demonstrating its value as a training resource for robust audio deepfake detection.

Abstract

Text-to-Speech (TTS) technology offers notable benefits, such as providing a voice for individuals with speech impairments, but it also facilitates the creation of audio deepfakes and spoofing attacks. AI-based detection methods can help mitigate these risks; however, the performance of such models is inherently dependent on the quality and diversity of their training data. Presently, the available datasets are heavily skewed towards English and Chinese audio, which limits the global applicability of these anti-spoofing systems. To address this limitation, this paper presents the Multi-Language Audio Anti-Spoofing Dataset (MLAAD), version 7, created using 101 TTS models, comprising 52 different architectures, to generate 485.3 hours of synthetic voice in 40 different languages. We train and evaluate three state-of-the-art deepfake detection models with MLAAD and observe that it demonstrates superior performance over comparable datasets like InTheWild and Fake-Or-Real when used as a training resource. Moreover, compared to the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, each excelling on four datasets. By publishing MLAAD and making a trained model accessible via an interactive webserver, we aim to democratize anti-spoofing technology, making it accessible beyond the realm of specialists, and contributing to global efforts against audio spoofing and deepfakes.

Key findings

Models trained on MLAAD demonstrate strong cross-dataset generalization. MLAAD and ASVspoof 2019 showed complementary performance, each excelling on different datasets. The results suggest MLAAD is a valuable resource for improving the robustness and global applicability of audio deepfake detection systems.

Approach

The authors created the MLAAD dataset using 101 TTS models to generate synthetic speech across 40 languages. They then trained and evaluated three state-of-the-art deepfake detection models on MLAAD and other datasets to compare performance and generalization capabilities.

Datasets

MLAAD (created by authors), ASVspoof 2019, ASVspoof 2021-DF, ASVspoof 2021-LA, FakeOrReal, InTheWild, Voc.v, WaveFake, RIRS Noises, Noise ESC-50, Free Music Archive (Instrumental Music), Musan, M-AILABS Speech Dataset

Model(s)

RawGat-ST, SSL-W2V2, WhisperDF

Author countries

Germany, Poland

← Previous