Exploring WavLM Back-ends for Speech Spoofing and Deepfake Detection

Authors: Theophile Stourbe, Victor Miara, Theo Lepage, Reda Dehak

Published: 2024-09-08 08:54:36+00:00

AI Summary

This paper presents a speech deepfake detection system that leverages a pre-trained WavLM model as a front-end and explores different back-end techniques for aggregating its representations. The system achieves state-of-the-art results on the ASVspoof 2024 challenge, demonstrating the effectiveness of this approach.

Abstract

This paper describes our submitted systems to the ASVspoof 5 Challenge Track 1: Speech Deepfake Detection - Open Condition, which consists of a stand-alone speech deepfake (bonafide vs spoof) detection task. Recently, large-scale self-supervised models become a standard in Automatic Speech Recognition (ASR) and other speech processing tasks. Thus, we leverage a pre-trained WavLM as a front-end model and pool its representations with different back-end techniques. The complete framework is fine-tuned using only the trained dataset of the challenge, similar to the close condition. Besides, we adopt data-augmentation by adding noise and reverberation using MUSAN noise and RIR datasets. We also experiment with codec augmentations to increase the performance of our method. Ultimately, we use the Bosaris toolkit for score calibration and system fusion to get better Cllr scores. Our fused system achieves 0.0937 minDCF, 3.42% EER, 0.1927 Cllr, and 0.1375 actDCF.


Key findings
The proposed system achieved a 0.0937 minDCF and 3.42% EER on the ASVspoof 5 evaluation dataset. Data augmentation significantly improved performance, and the fusion of systems using different back-end techniques further enhanced results. The A28 condition (YourTTS generated speech) proved to be the most challenging.
Approach
The authors utilize a pre-trained WavLM model to extract features from speech audio. These features are then aggregated using different back-end techniques (Weighted Average and Multi-Head Factorized Attention) before being fed into a classifier for deepfake detection. Data augmentation techniques are employed to improve robustness.
Datasets
ASVspoof 5 Challenge Track 1 dataset (training and development sets), MUSAN noise dataset, RIR dataset, Librispeech dataset (for WavLM pre-training)
Model(s)
Pre-trained WavLM Base model, ResNet-34 (baseline), Weighted Average pooling, Multi-Head Factorized Attention pooling
Author countries
France