WavLM model ensemble for audio deepfake detection

Authors: David Combei, Adriana Stan, Dan Oneata, Horia Cucu

Published: 2024-08-14 09:43:35+00:00

AI Summary

This paper presents a method for audio deepfake detection using an ensemble of WavLM models. The approach benchmarks various pretrained representations, finding WavLM to be superior, and then finetunes WavLM models with data augmentation, achieving a low equal error rate (EER) in the ASVspoof5 challenge.

Abstract

Audio deepfake detection has become a pivotal task over the last couple of years, as many recent speech synthesis and voice cloning systems generate highly realistic speech samples, thus enabling their use in malicious activities. In this paper we address the issue of audio deepfake detection as it was set in the ASVspoof5 challenge. First, we benchmark ten types of pretrained representations and show that the self-supervised representations stemming from the wav2vec2 and wavLM families perform best. Of the two, wavLM is better when restricting the pretraining data to LibriSpeech, as required by the challenge rules. To further improve performance, we finetune the wavLM model for the deepfake detection task. We extend the ASVspoof5 dataset with samples from other deepfake detection datasets and apply data augmentation. Our final challenge submission consists of a late fusion combination of four models and achieves an equal error rate of 6.56% and 17.08% on the two evaluation sets.


Key findings
WavLM models consistently outperformed other architectures. Finetuning with data augmentation significantly improved performance. A late fusion ensemble of four WavLM models achieved an EER of 6.56% and 17.08% on the ASVspoof5 evaluation sets.
Approach
The authors benchmark several pretrained audio representations for deepfake detection, selecting WavLM as the best performing. They then finetune the WavLM model using augmented data from multiple datasets and finally employ late fusion of multiple finetuned and pretrained WavLM models to improve performance.
Datasets
ASVspoof5, ASVspoof 2019, ASVspoof 2021, Fake or Real (FoR), In the Wild (ITW), LibriSpeech
Model(s)
WavLM, wav2vec2, HuBERT, DeCoAR2, Distill-HuBERT, BEATs, ECAPA-TDNN, TitaNet, LEAF, HEAR’s YAMNet, wavLM-large, wav2vec2-xls-r-2b (not used in challenge)
Author countries
Romania, Romania