Learnable Spectro-temporal Receptive Fields for Robust Voice Type Discrimination

View on arXiv ← Back to list

Authors: Tyler Vuong, Yangyang Xia, Richard Stern

Published: 2020-10-19 00:29:02+00:00

AI Summary

This paper proposes a deep-learning system for Voice Type Discrimination (VTD), which distinguishes live speech from playback audio. The system uses a learnable spectro-temporal receptive field (STRF) layer for robust feature extraction, showing strong performance on VTD and ASVspoof 2019 spoofing detection tasks.

Abstract

Voice Type Discrimination (VTD) refers to discrimination between regions in a recording where speech was produced by speakers that are physically within proximity of the recording device (Live Speech) from speech and other types of audio that were played back such as traffic noise and television broadcasts (Distractor Audio). In this work, we propose a deep-learning-based VTD system that features an initial layer of learnable spectro-temporal receptive fields (STRFs). Our approach is also shown to provide very strong performance on a similar spoofing detection task in the ASVspoof 2019 challenge. We evaluate our approach on a new standardized VTD database that was collected to support research in this area. In particular, we study the effect of using learnable STRFs compared to static STRFs or unconstrained kernels. We also show that our system consistently improves a competitive baseline system across a wide range of signal-to-noise ratios on spoofing detection in the presence of VTD distractor noise.

Key findings

The proposed learnable STRFNet consistently outperforms baselines on both VTD and ASVspoof 2019 tasks. Learnable STRFs are shown to be essential for robust spoofing detection in noisy environments and VTD. The Hybrid model, combining generic and STRF kernels, performed best and was most robust to noise and different synthesis methods.

Approach

The authors propose STRFNet, a deep learning model that incorporates a layer of learnable spectro-temporal receptive fields (STRFs) for robust feature extraction from audio. This is followed by convolutional layers with residual connections, bidirectional GRUs with self-attention, and an MLP for classification.

Datasets

A new standardized VTD database collected by SRI International, and the ASVspoof 2019 challenge dataset (Logical Access task), with added VTD distractor audio and downsampling to 11,025 Hz.

Model(s)

Convolutional Neural Network (CNN) with learnable spectro-temporal receptive fields (STRFs), residual blocks, bidirectional gated recurrent units (GRUs) and self-attention.

Author countries

USA

← Previous