GMM-ResNet2: Ensemble of Group ResNet Networks for Synthetic Speech Detection

View on arXiv ← Back to list

Authors: Zhenchun Lei, Hui Yan, Changhong Liu, Yong Zhou, Minglei Ma

Published: 2024-07-02 11:25:42+00:00

AI Summary

This paper proposes GMM-ResNet2, an improved model for synthetic speech detection. It enhances a previous GMM-ResNet model by using multi-scale Log Gaussian Probability features, a grouping technique for ensemble learning, an improved residual block, and an ensemble-aware loss function, resulting in significant performance gains on ASVspoof 2019 and 2021 datasets.

Abstract

Deep learning models are widely used for speaker recognition and spoofing speech detection. We propose the GMM-ResNet2 for synthesis speech detection. Compared with the previous GMM-ResNet model, GMM-ResNet2 has four improvements. Firstly, the different order GMMs have different capabilities to form smooth approximations to the feature distribution, and multiple GMMs are used to extract multi-scale Log Gaussian Probability features. Secondly, the grouping technique is used to improve the classification accuracy by exposing the group cardinality while reducing both the number of parameters and the training time. The final score is obtained by ensemble of all group classifier outputs using the averaging method. Thirdly, the residual block is improved by including one activation function and one batch normalization layer. Finally, an ensemble-aware loss function is proposed to integrate the independent loss functions of all ensemble members. On the ASVspoof 2019 LA task, the GMM-ResNet2 achieves a minimum t-DCF of 0.0227 and an EER of 0.79%. On the ASVspoof 2021 LA task, the GMM-ResNet2 achieves a minimum t-DCF of 0.2362 and an EER of 2.19%, and represents a relative reductions of 31.4% and 76.3% compared with the LFCC-LCNN baseline.

Key findings

GMM-ResNet2 achieved state-of-the-art performance on ASVspoof 2019 LA and competitive results on ASVspoof 2021 LA and DF tasks. Ablation studies showed the importance of the ensemble-aware loss function and the improved residual block. Compared to previous models, significant reductions in minimum t-DCF and EER were observed.

Approach

GMM-ResNet2 uses multiple GMMs to extract multi-scale Log Gaussian Probability features from LFCCs. These features are grouped, and separate ResNet models process each group, with their outputs averaged for the final prediction. An ensemble-aware loss function further optimizes the ensemble performance.

Datasets

ASVspoof 2019 LA task, ASVspoof 2021 LA task, ASVspoof 2021 DF task

Model(s)

GMM-ResNet2 (an ensemble of grouped ResNet networks)

Author countries

China

← Previous