Using Multi-Resolution Feature Maps with Convolutional Neural Networks for Anti-Spoofing in ASV

Authors: Qiongqiong Wang, Kong Aik Lee, Takafumi Koshinaka

Published: 2020-08-20 10:00:03+00:00

AI Summary

This paper proposes a method for anti-spoofing in automatic speaker verification (ASV) that uses multi-resolution feature maps with convolutional neural networks (CNNs). By stacking spectrograms extracted with different window lengths, the method improves both time and frequency resolution, leading to more discriminative representations of audio segments.

Abstract

This paper presents a simple but effective method that uses multi-resolution feature maps with convolutional neural networks (CNNs) for anti-spoofing in automatic speaker verification (ASV). The central idea is to alleviate the problem that the feature maps commonly used in anti-spoofing networks are insufficient for building discriminative representations of audio segments, as they are often extracted by a single-length sliding window. Resulting trade-offs between time and frequency resolutions restrict the information in single spectrograms. The proposed method improves both frequency resolution and time resolution by stacking multiple spectrograms that are extracted using different window lengths. These are fed into a convolutional neural network in the form of multiple channels, making it possible to extract more information from input signals while only marginally increasing computational costs. The efficiency of the proposed method has been conformed on the ASVspoof 2019 database. We show that the use of the proposed multiresolution inputs consistently outperforms that of score fusion across different CNN architectures. Moreover, computational cost remains small.


Key findings
The multi-resolution input consistently outperforms single-resolution inputs and score fusion across different CNN architectures. Using 2-resolution feature maps decreased EER by approximately 21.5% and 37.0% for ResNet18 and SENet50 respectively, with minimal computational overhead. 3-resolution inputs yielded even better results, with reductions of 38.4% and 45.3% for ResNet18 and SENet50 respectively.
Approach
The approach improves audio feature representation by stacking multiple spectrograms with varying window lengths (18ms, 25ms, and 30ms) as input channels to a CNN. This allows the network to learn from both high time and high frequency resolution information simultaneously, improving accuracy compared to using single-resolution spectrograms.
Datasets
ASVspoof 2019 Physical Access (PA) subset
Model(s)
ResNet18, SENet50, Light CNN
Author countries
Japan