Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders

Authors: Xin Wang, Junichi Yamagishi

Published: 2022-10-19 14:10:02+00:00

AI Summary

This paper proposes a method for efficiently creating spoofed training data for speech spoofing countermeasures using neural vocoders, instead of relying on computationally expensive TTS and VC systems. A contrastive feature loss is introduced to improve the training process by leveraging the relationship between bona fide and spoofed data pairs.

Abstract

A good training set for speech spoofing countermeasures requires diverse TTS and VC spoofing attacks, but generating TTS and VC spoofed trials for a target speaker may be technically demanding. Instead of using full-fledged TTS and VC systems, this study uses neural-network-based vocoders to do copy-synthesis on bona fide utterances. The output data can be used as spoofed data. To make better use of pairs of bona fide and spoofed data, this study introduces a contrastive feature loss that can be plugged into the standard training criterion. On the basis of the bona fide trials from the ASVspoof 2019 logical access training set, this study empirically compared a few training sets created in the proposed manner using a few neural non-autoregressive vocoders. Results on multiple test sets suggest good practices such as fine-tuning neural vocoders using bona fide data from the target domain. The results also demonstrated the effectiveness of the contrastive feature loss. Combining the best practices, the trained CM achieved overall competitive performance. Its EERs on the ASVspoof 2021 hidden subsets also outperformed the top-1 challenge submission.


Key findings
Fine-tuning neural vocoders on target domain bona fide data and employing the contrastive feature loss significantly improved the countermeasure's performance. The resulting countermeasure achieved competitive performance on multiple benchmark datasets, even outperforming the top-1 submission in ASVspoof 2021 hidden subsets.
Approach
The authors use neural vocoders to perform copy-synthesis on bona fide utterances, generating spoofed data. A contrastive feature loss is incorporated into the training criterion to better utilize the pairs of bona fide and spoofed data.
Datasets
ASVspoof 2019 logical access training set (LA19trn), ASVspoof 2019 LA development set, ASVspoof 2019 LA test set (LA19eval), ASVspoof 2021 LA evaluation subsets (LA21eval), ASVspoof 2021 DF evaluation subsets (DF21eval), ASVspoof 2021 hidden subsets (LA21hid, DF21hid), WaveFake, In the Wild (InWild), LibriTTS
Model(s)
Wav2vec 2.0 (for feature extraction), a feedforward classifier, HiFiGAN, Parallel WaveGAN (PWG), MultiBand MelGAN, Harmonic-plus-noise neural source-filter model (Hn-NSF), combination of Hn-NSF and HiFiGAN (NSF-HiFiGAN), WaveGlow
Author countries
Japan