Experimental Study: Enhancing Voice Spoofing Detection Models with wav2vec 2.0

View on arXiv ← Back to list

Authors: Taein Kang, Soyul Han, Sunmook Choi, Jaejin Seo, Sanghyeok Chung, Seungeun Lee, Seungsang Oh, Il-Youp Kwak

Published: 2024-02-27 01:45:51+00:00

AI Summary

This research investigates using wav2vec 2.0 as an audio feature extractor for voice spoofing detection. By selectively choosing and fine-tuning Transformer layers within wav2vec 2.0, the authors achieve state-of-the-art performance on the ASVspoof 2019 LA dataset with various spoofing detection back-end models.

Abstract

Conventional spoofing detection systems have heavily relied on the use of handcrafted features derived from speech data. However, a notable shift has recently emerged towards the direct utilization of raw speech waveforms, as demonstrated by methods like SincNet filters. This shift underscores the demand for more sophisticated audio sample features. Moreover, the success of deep learning models, particularly those utilizing large pretrained wav2vec 2.0 as a featurization front-end, highlights the importance of refined feature encoders. In response, this research assessed the representational capability of wav2vec 2.0 as an audio feature extractor, modifying the size of its pretrained Transformer layers through two key adjustments: (1) selecting a subset of layers starting from the leftmost one and (2) fine-tuning a portion of the selected layers from the rightmost one. We complemented this analysis with five spoofing detection back-end models, with a primary focus on AASIST, enabling us to pinpoint the optimal configuration for the selection and fine-tuning process. In contrast to conventional handcrafted features, our investigation identified several spoofing detection systems that achieve state-of-the-art performance in the ASVspoof 2019 LA dataset. This comprehensive exploration offers valuable insights into feature selection strategies, advancing the field of spoofing detection.

Key findings

The proposed method achieves state-of-the-art performance on the ASVspoof 2019 LA dataset, with the wav2vec 2.0 + RawNet2 system achieving an EER of 0.12% and a minimum t-DCF of 0.0032. The optimal configuration involved selecting and fine-tuning specific numbers of Transformer layers within wav2vec 2.0, demonstrating the importance of this hyperparameter optimization.

Approach

The authors use pretrained wav2vec 2.0 models as a front-end for feature extraction. They optimize the model by selecting a subset of the Transformer layers and fine-tuning a portion of those selected layers. This is then combined with several different back-end spoofing detection models.

Datasets

ASVspoof 2019 LA dataset

Model(s)

wav2vec 2.0 (various versions: XLSR-53, XLS-R(0.3B), XLS-R(1B), XLS-R(2B)), LCNN, Non-OFD, RawNet2, RawGAT-ST, AASIST

Author countries

Republic of Korea

← Previous