UR Channel-Robust Synthetic Speech Detection System for ASVspoof 2021

Authors: Xinhui Chen, You Zhang, Ge Zhu, Zhiyao Duan

Published: 2021-07-26 08:15:24+00:00

AI Summary

This paper presents a channel-robust synthetic speech detection system for the ASVspoof 2021 challenge. It uses an acoustic simulator to augment datasets with various codec and channel effects, and employs an ECAPA-TDNN model with one-class learning and channel-robust training strategies.

Abstract

In this paper, we present UR-AIR system submission to the logical access (LA) and the speech deepfake (DF) tracks of the ASVspoof 2021 Challenge. The LA and DF tasks focus on synthetic speech detection (SSD), i.e. detecting text-to-speech and voice conversion as spoofing attacks. Different from previous ASVspoof challenges, the LA task this year presents codec and transmission channel variability, while the new task DF presents general audio compression. Built upon our previous research work on improving the robustness of the SSD systems to channel effects, we propose a channel-robust synthetic speech detection system for the challenge. To mitigate the channel variability issue, we use an acoustic simulator to apply transmission codec, compression codec, and convolutional impulse responses to augmenting the original datasets. For the neural network backbone, we propose to use Emphasized Channel Attention, Propagation and Aggregation Time Delay Neural Networks (ECAPA-TDNN) as our primary model. We also incorporate one-class learning with channel-robust training strategies to further learn a channel-invariant speech representation. Our submission achieved EER 20.33% in the DF task; EER 5.46% and min-tDCF 0.3094 in the LA task.


Key findings
The proposed system achieved an EER of 20.33% on the DF task and an EER of 5.46% and min-tDCF of 0.3094 on the LA task of the ASVspoof 2021 challenge. The results show significant improvement over baseline systems, particularly in robustness to various channel conditions.
Approach
The authors address channel variability in synthetic speech detection by augmenting training data using an acoustic simulator to introduce various codec and transmission channel effects. They utilize an ECAPA-TDNN neural network architecture, along with one-class learning and channel-robust training strategies for improved robustness and generalization.
Datasets
ASVspoof 2019 training and development datasets, ASVspoof 2021 LA and DF evaluation datasets.
Model(s)
ECAPA-TDNN (multiple variants), ResNet
Author countries
USA