Unmasking Deepfakes: Leveraging Augmentations and Features Variability for Deepfake Speech Detection

Authors: Inbal Rimon, Oren Gal, Haim Permuter

Published: 2025-01-09 19:31:10+00:00

AI Summary

This paper presents a hybrid deepfake speech detection model combining a self-supervised feature extractor (Wav2Vec 2.0) with a ResNet34 classifier. The model incorporates novel audio and feature-level augmentations, achieving state-of-the-art results on the ASVSpoof5 challenge.

Abstract

The detection of deepfake speech has become increasingly challenging with the rapid evolution of deepfake technologies. In this paper, we propose a hybrid architecture for deepfake speech detection, combining a self-supervised learning framework for feature extraction with a classifier head to form an end-to-end model. Our approach incorporates both audio-level and feature-level augmentation techniques. Specifically, we introduce and analyze various masking strategies for augmenting raw audio spectrograms and for enhancing feature representations during training. We incorporate compression augmentations during the pretraining phase of the feature extractor to address the limitations of small, single-language datasets. We evaluate the model on the ASVSpoof5 (ASVSpoof 2024) challenge, achieving state-of-the-art results in Track 1 under closed conditions with an Equal Error Rate of 4.37%. By employing different pretrained feature extractors, the model achieves an enhanced EER of 3.39%. Our model demonstrates robust performance against unseen deepfake attacks and exhibits strong generalization across different codecs.


Key findings
The proposed model achieved state-of-the-art results on ASVSpoof5 Track 1 under closed conditions with an Equal Error Rate (EER) of 3.39% after model fusion. The use of both audio-level and feature-level augmentations, along with different pretrained feature extractors, significantly improved performance. The model showed robust performance against unseen deepfake attacks and good generalization across different codecs.
Approach
The authors propose a hybrid architecture that uses Wav2Vec 2.0 for self-supervised feature extraction and a ResNet34 for classification. They introduce novel augmentation strategies at both the raw audio and feature levels, including masking techniques and compression augmentation, to improve model robustness and generalization.
Datasets
ASVSpoof5 (ASVSpoof 2024) challenge dataset, including Multilingual LibriSpeech, CommonVoice, and Babel datasets for pretraining.
Model(s)
Wav2Vec 2.0 (feature extractor), ResNet34 (classifier)
Author countries
Israel