Unmasking Deepfakes: Leveraging Augmentations and Features Variability for Deepfake Speech Detection
Authors: Inbal Rimon, Oren Gal, Haim Permuter
Published: 2025-01-09 19:31:10+00:00
AI Summary
This paper presents a hybrid deepfake speech detection model combining a self-supervised feature extractor (Wav2Vec 2.0) with a ResNet34 classifier. The model incorporates novel audio and feature-level augmentations, achieving state-of-the-art results on the ASVSpoof5 challenge.
Abstract
The detection of deepfake speech has become increasingly challenging with the rapid evolution of deepfake technologies. In this paper, we propose a hybrid architecture for deepfake speech detection, combining a self-supervised learning framework for feature extraction with a classifier head to form an end-to-end model. Our approach incorporates both audio-level and feature-level augmentation techniques. Specifically, we introduce and analyze various masking strategies for augmenting raw audio spectrograms and for enhancing feature representations during training. We incorporate compression augmentations during the pretraining phase of the feature extractor to address the limitations of small, single-language datasets. We evaluate the model on the ASVSpoof5 (ASVSpoof 2024) challenge, achieving state-of-the-art results in Track 1 under closed conditions with an Equal Error Rate of 4.37%. By employing different pretrained feature extractors, the model achieves an enhanced EER of 3.39%. Our model demonstrates robust performance against unseen deepfake attacks and exhibits strong generalization across different codecs.