Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks

Authors: Zhenzong Wu, Rohan Kumar Das, Jichen Yang, Haizhou Li

Published: 2020-09-21 06:38:19+00:00

AI Summary

This paper proposes a novel feature genuinization method for synthetic speech detection. It uses a CNN-based transformer trained on genuine speech to enhance the difference between genuine and synthetic speech features before classification with a light CNN. This approach outperforms state-of-the-art methods on the ASVspoof 2019 dataset.

Abstract

Modern text-to-speech (TTS) and voice conversion (VC) systems produce natural sounding speech that questions the security of automatic speaker verification (ASV). This makes detection of such synthetic speech very important to safeguard ASV systems from unauthorized access. Most of the existing spoofing countermeasures perform well when the nature of the attacks is made known to the system during training. However, their performance degrades in face of unseen nature of attacks. In comparison to the synthetic speech created by a wide range of TTS and VC methods, genuine speech has a more consistent distribution. We believe that the difference between the distribution of synthetic and genuine speech is an important discriminative feature between the two classes. In this regard, we propose a novel method referred to as feature genuinization that learns a transformer with convolutional neural network (CNN) using the characteristics of only genuine speech. We then use this genuinization transformer with a light CNN classifier. The ASVspoof 2019 logical access corpus is used to evaluate the proposed method. The studies show that the proposed feature genuinization based LCNN system outperforms other state-of-the-art spoofing countermeasures, depicting its effectiveness for detection of synthetic speech attacks.


Key findings
The proposed feature genuinization based LCNN system outperforms other state-of-the-art spoofing countermeasures, achieving a lower t-DCF and EER on the ASVspoof 2019 logical access corpus evaluation set. A contrast experiment using a model trained on spoofed speech instead showed worse performance, validating the proposed approach.
Approach
The proposed approach uses a CNN-based transformer (genuinizaiton transformer) trained solely on genuine speech to transform both genuine and synthetic speech features. These transformed features are then fed into a light CNN classifier for spoofing detection. The transformer aims to amplify the differences between genuine and synthetic speech distributions.
Datasets
ASVspoof 2019 logical access corpus
Model(s)
Convolutional Neural Network (CNN) based genuinization transformer and a light CNN classifier.
Author countries
Singapore, China