Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation

Authors: Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee-weon Jung, Junichi Yamagishi, Nicholas Evans

Published: 2022-02-24 17:55:00+00:00

Comment: Submitted to Speaker Odyssey Workshop 2022

AI Summary

This paper proposes a novel approach for automatic speaker verification spoofing and deepfake detection utilizing a fine-tuned wav2vec 2.0 self-supervised learning front-end. Combined with a new self-attentive aggregation layer and data augmentation, the method significantly improves generalization to unseen attacks. It achieves the lowest equal error rates reported in the literature for both the ASVspoof 2021 Logical Access and Deepfake databases.

Abstract

The performance of spoofing countermeasure systems depends fundamentally upon the use of sufficiently representative training data. With this usually being limited, current solutions typically lack generalisation to attacks encountered in the wild. Strategies to improve reliability in the face of uncontrolled, unpredictable attacks are hence needed. We report in this paper our efforts to use self-supervised learning in the form of a wav2vec 2.0 front-end with fine tuning. Despite initial base representations being learned using only bona fide data and no spoofed data, we obtain the lowest equal error rates reported in the literature for both the ASVspoof 2021 Logical Access and Deepfake databases. When combined with data augmentation,these results correspond to an improvement of almost 90% relative to our baseline system.


Key findings
The system achieved record-low equal error rates (EERs) of 0.82% for the ASVspoof 2021 LA database and 2.85% for the ASVspoof 2021 DF database. This represents up to a 90% relative reduction in EER for logical access spoofing and an 88% reduction for deepfake detection compared to the baseline. The self-supervised front-end combined with data augmentation consistently demonstrated improved generalization and domain robustness.
Approach
The approach replaces the traditional sinc-layer front-end with a pre-trained and fine-tuned wav2vec 2.0 XLS-R model for robust feature extraction. These features are then fed into an AASIST (Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks) back-end, enhanced with a self-attention based aggregation layer and on-the-fly data augmentation (RawBoost).
Datasets
ASVspoof 2019 LA (training and validation), ASVspoof 2021 LA (evaluation), ASVspoof 2021 Deepfake (DF) (evaluation). The wav2vec 2.0 XLS-R model was pre-trained on VoxPopuli (VP-400K), multilingual Librispeech corpus (MLS), CommonVoice (CV), VoxLingua107 (VL), and BABEL (BBL) datasets.
Model(s)
wav2vec 2.0 XLS-R (0.3B) model (front-end), AASIST (Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks) (back-end), RawNet2-based residual encoder, Self-attention based aggregation layer, Graph Attention Network (GAT), Heterogeneous Stacking Graph Attention Layer (HS-GAL).
Author countries
France, Japan, South Korea