Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?

Authors: Xin Wang, Junichi Yamagishi

Published: 2023-09-12 07:25:08+00:00

AI Summary

This research explores using large-scale vocoded speech data to improve speech spoofing countermeasures (CMs). By continually training a self-supervised learning (SSL) model on over 9,000 hours of vocoded data, the authors demonstrate significant improvements in CM performance on various unseen test sets, surpassing previous state-of-the-art results.

Abstract

A speech spoofing countermeasure (CM) that discriminates between unseen spoofed and bona fide data requires diverse training data. While many datasets use spoofed data generated by speech synthesis systems, it was recently found that data vocoded by neural vocoders were also effective as the spoofed training data. Since many neural vocoders are fast in building and generation, this study used multiple neural vocoders and created more than 9,000 hours of vocoded data on the basis of the VoxCeleb2 corpus. This study investigates how this large-scale vocoded data can improve spoofing countermeasures that use data-hungry self-supervised learning (SSL) models. Experiments demonstrated that the overall CM performance on multiple test sets improved when using features extracted by an SSL model continually trained on the vocoded data. Further improvement was observed when using a new SSL distilled from the two SSLs before and after the continual training. The CM with the distilled SSL outperformed the previous best model on challenging unseen test sets, including the ASVspoof 2019 logical access, WaveFake, and In-the-Wild.


Key findings
Continuously training an SSL model on the large-scale vocoded data improved CM performance. Using features from both the continually trained and pre-trained SSL models, or distilling them into a single model, further enhanced results. The resulting CM outperformed previous best models on challenging unseen datasets.
Approach
The authors generated a large-scale dataset of vocoded speech using multiple neural vocoders. They then used this data to continually train a self-supervised learning (SSL) model, either using the trained model directly or distilling it with a pre-trained SSL model to improve the feature extraction for a speech spoofing countermeasure.
Datasets
VoxCeleb2, ASVspoof 2019 logical access (LA), ASVspoof 2021 LA, ASVspoof 2021 DF, WaveFake, In-the-Wild
Model(s)
Wav2Vec 2.0 (XLSR-53, w2v), HiFi-GAN, WaveGlow, Hn-NSF, a fusion of Hn-NSF and HiFi-GAN
Author countries
Japan