Two-Path GMM-ResNet and GMM-SENet for ASV Spoofing Detection

Authors: Zhenchun Lei, Hui Yan, Changhong Liu, Minglei Ma, Yingen Yang

Published: 2024-07-08 04:42:36+00:00

AI Summary

This paper proposes two-path GMM-ResNet and GMM-SENet models for audio spoofing detection. These models leverage Gaussian probability features from two GMMs (one for genuine and one for spoofed speech) and utilize ResNet and SENet architectures to capture both score distribution on GMM components and inter-frame relationships, achieving significant performance improvements over the baseline GMM.

Abstract

The automatic speaker verification system is sometimes vulnerable to various spoofing attacks. The 2-class Gaussian Mixture Model classifier for genuine and spoofed speech is usually used as the baseline for spoofing detection. However, the GMM classifier does not separately consider the scores of feature frames on each Gaussian component. In addition, the GMM accumulates the scores on all frames independently, and does not consider their correlations. We propose the two-path GMM-ResNet and GMM-SENet models for spoofing detection, whose input is the Gaussian probability features based on two GMMs trained on genuine and spoofed speech respectively. The models consider not only the score distribution on GMM components, but also the relationship between adjacent frames. A two-step training scheme is applied to improve the system robustness. Experiments on the ASVspoof 2019 show that the LFCC+GMM-ResNet system can relatively reduce min-tDCF and EER by 76.1% and 76.3% on logical access scenario compared with the GMM, and the LFCC+GMM-SENet system by 94.4% and 95.4% on physical access scenario. After score fusion, the systems give the second-best results on both scenarios.


Key findings
The proposed GMM-ResNet and GMM-SENet models significantly outperform the baseline GMM on ASVspoof 2019. The two-path and two-step training schemes further improve performance. After score fusion, the system achieves state-of-the-art results, ranking second on the ASVspoof 2019 leaderboards for both scenarios.
Approach
The authors propose using Gaussian probability features extracted from two GMMs (trained on genuine and spoofed speech) as input to ResNet and SENet architectures. A two-step training scheme is employed to improve robustness. The two-path architecture processes features from both GMMs and concatenates the results before final classification.
Datasets
ASVspoof 2019 database (Logical Access and Physical Access scenarios)
Model(s)
GMM-ResNet, GMM-SENet
Author countries
China