Replay and Synthetic Speech Detection with Res2net Architecture

Authors: Xu Li, Na Li, Chao Weng, Xunying Liu, Dan Su, Dong Yu, Helen Meng

Published: 2020-10-28 14:33:42+00:00

Comment: Accepted to ICASSP2021

AI Summary

This work proposes leveraging the Res2Net architecture for replay and synthetic speech detection to improve generalizability to unseen spoofing attacks. Res2Net modifies the ResNet block to enable multiple feature scales, which significantly enhances the anti-spoofing countermeasure's performance and reduces model size. Experimental results demonstrate Res2Net's consistent outperformance over ResNet34 and ResNet50 on the ASVspoof 2019 corpus, particularly when integrated with the Squeeze-and-Excitation (SE) block and using Constant-Q Transform (CQT) acoustic features.

Abstract

Existing approaches for replay and synthetic speech detection still lack generalizability to unseen spoofing attacks. This work proposes to leverage a novel model structure, so-called Res2Net, to improve the anti-spoofing countermeasure's generalizability. Res2Net mainly modifies the ResNet block to enable multiple feature scales. Specifically, it splits the feature maps within one block into multiple channel groups and designs a residual-like connection across different channel groups. Such connection increases the possible receptive fields, resulting in multiple feature scales. This multiple scaling mechanism significantly improves the countermeasure's generalizability to unseen spoofing attacks. It also decreases the model size compared to ResNet-based models. Experimental results show that the Res2Net model consistently outperforms ResNet34 and ResNet50 by a large margin in both physical access (PA) and logical access (LA) of the ASVspoof 2019 corpus. Moreover, integration with the squeeze-and-excitation (SE) block can further enhance performance. For feature engineering, we investigate the generalizability of Res2Net combined with different acoustic features, and observe that the constant-Q transform (CQT) achieves the most promising performance in both PA and LA scenarios. Our best single system outperforms other state-of-the-art single systems in both PA and LA of the ASVspoof 2019 corpus.


Key findings
The Res2Net model significantly outperforms ResNet34 and ResNet50 in both PA and LA scenarios of the ASVspoof 2019 corpus, achieving substantial EER reductions while also decreasing model size. Integration with the Squeeze-and-Excitation (SE) block further enhances performance. The Constant-Q Transform (CQT) acoustic feature yields the most promising results, enabling the best single system to outperform other state-of-the-art single systems.
Approach
The authors propose using the Res2Net architecture, which enhances ResNet blocks by splitting feature maps into multiple channel groups and applying residual-like connections across them to enable multiple feature scales. This mechanism improves the model's capacity and generalizability. Additionally, they integrate the Squeeze-and-Excitation (SE) block and investigate various acoustic features, finding the Constant-Q Transform (CQT) to be most effective.
Datasets
ASVspoof 2019 corpus (Physical Access - PA, Logical Access - LA)
Model(s)
Res2Net, SE-Res2Net, ResNet34, ResNet50
Author countries
China