Learnable Spectro-temporal Receptive Fields for Robust Voice Type Discrimination

Authors: Tyler Vuong, Yangyang Xia, Richard Stern

Published: 2020-10-19 00:29:02+00:00

Comment: Accepted Interspeech 2020. Video: http://www.interspeech2020.org/index.php?m=content&c=index&a=show&catid=311&id=712

AI Summary

This paper proposes a deep-learning-based Voice Type Discrimination (VTD) system, named STRFNet, which incorporates an initial layer of learnable spectro-temporal receptive fields (STRFs). The system demonstrates strong performance on a new VTD database and the ASVspoof 2019 challenge's spoofing detection task. The research highlights the effectiveness of learnable STRFs in improving robustness against various noise conditions and consistently outperforming competitive baseline systems.

Abstract

Voice Type Discrimination (VTD) refers to discrimination between regions in a recording where speech was produced by speakers that are physically within proximity of the recording device (Live Speech) from speech and other types of audio that were played back such as traffic noise and television broadcasts (Distractor Audio). In this work, we propose a deep-learning-based VTD system that features an initial layer of learnable spectro-temporal receptive fields (STRFs). Our approach is also shown to provide very strong performance on a similar spoofing detection task in the ASVspoof 2019 challenge. We evaluate our approach on a new standardized VTD database that was collected to support research in this area. In particular, we study the effect of using learnable STRFs compared to static STRFs or unconstrained kernels. We also show that our system consistently improves a competitive baseline system across a wide range of signal-to-noise ratios on spoofing detection in the presence of VTD distractor noise.


Key findings
The learnable STRFNet system consistently outperformed both generic CNN baselines and static STRF implementations on both the VTD and ASVspoof-LA tasks. Specifically, learnable STRFs provided significant relative reductions in DCF and EER for VTD, demonstrating their importance for robust performance. The Hybrid model (combining generic and STRF kernels) was found to be the most robust for spoofing detection, suggesting that STRFs effectively reject distractor noise but may not be solely sufficient for differentiating real from synthetic speech.
Approach
The proposed STRFNet system utilizes a deep learning architecture that begins with a convolutional layer where kernels are re-parameterized as learnable spectro-temporal receptive fields (STRFs) to extract speech-specific features. This is followed by residual convolutional blocks, a stacked bidirectional Gated Recurrent Unit (GRU) with a self-attention pooling layer for temporal modeling, and a Multi-Layer Perceptron (MLP) for final classification. The input features are log-mel spectrograms.
Datasets
A new standardized VTD database collected by SRI International for JHU/APL, and the ASVspoof 2019 challenge dataset (Logical Access task, denoted as ASVspoof-LA, modified with VTD distractor audio and downsampled).
Model(s)
STRFNet (custom deep learning system featuring learnable Spectro-Temporal Receptive Fields, residual convolutional blocks, stacked bidirectional GRUs, self-attention pooling, and MLP). Baseline models included CNNK (generic CNN), Hybrid (combining generic and STRF kernels), and STRFNetS (static STRF kernels).
Author countries
USA