The Vicomtech Audio Deepfake Detection System based on Wav2Vec2 for the 2022 ADD Challenge

Authors: Juan M. Martín-Doñas, Aitor Álvarez

Published: 2022-03-03 08:49:17+00:00

AI Summary

This paper presents an audio deepfake detection system for the 2022 ADD challenge, combining a pre-trained wav2vec2 feature extractor with a downstream classifier. The system leverages contextualized speech representations from different transformer layers and data augmentation techniques to improve robustness and performance in various challenging audio conditions.

Abstract

This paper describes our submitted systems to the 2022 ADD challenge withing the tracks 1 and 2. Our approach is based on the combination of a pre-trained wav2vec2 feature extractor and a downstream classifier to detect spoofed audio. This method exploits the contextualized speech representations at the different transformer layers to fully capture discriminative information. Furthermore, the classification model is adapted to the application scenario using different data augmentation techniques. We evaluate our system for audio synthesis detection in both the ASVspoof 2021 and the 2022 ADD challenges, showing its robustness and good performance in realistic challenging environments such as telephonic and audio codec systems, noisy audio, and partial deepfakes.


Key findings
The system achieved first and fourth place in tracks 1 and 2 of the 2022 ADD challenge, respectively. The use of data augmentation techniques, especially low-pass FIR filters, significantly improved performance. The system also demonstrated competitive results on the ASVspoof 2021 challenge, outperforming other methods in the speech deepfake track.
Approach
The system uses a pre-trained wav2vec2 model to extract features from different transformer layers. These features are then fed into a downstream classifier that uses temporal normalization, feed-forward layers, attentive statistical pooling, and a cosine similarity layer for final spoof detection. Data augmentation techniques further enhance the model's robustness.
Datasets
ASVspoof 2021 (LA and DF tracks), ADD 2022 (tracks 1 and 2), AISHELL-3, VCTK
Model(s)
wav2vec2 (XLS-53 and XLS-128 models) with a custom downstream classifier incorporating temporal normalization, feed-forward layers, attentive statistical pooling, and a cosine similarity layer.
Author countries
Spain