A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection

View on arXiv ← Back to list

Authors: Xin Wang, Junich Yamagishi

Published: 2021-03-21 07:39:20+00:00

AI Summary

This paper presents a comparative study of neural network-based speech spoofing countermeasures, focusing on varied-length input handling and loss functions. The authors found that average pooling for varied-length inputs and a new hyper-parameter-free loss function yielded a best-performing single model with an equal error rate (EER) of 1.92% on the ASVspoof 2019 logical access task.

Abstract

A great deal of recent research effort on speech spoofing countermeasures has been invested into back-end neural networks and training criteria. We contribute to this effort with a comparative perspective in this study. Our comparison of countermeasure models on the ASVspoof 2019 logical access task takes into account recently proposed margin-based training criteria, widely used front ends, and common strategies to deal with varied-length input trials. We also measured intra-model differences through multiple training-evaluation rounds with random initialization. Our statistical analysis demonstrates that the performance of the same model may be significantly different when just changing the random initial seed. Thus, we recommend similar analysis or multiple training-evaluation rounds for further research on the database. Despite the intra-model differences, we observed a few promising techniques such as the average pooling to process varied-length inputs and a new hyper-parameter-free loss function. The two techniques led to the best single model in our experiment, which achieved an equal error rate of 1.92% and was significantly different in statistical sense from most of the other experimental models.

Key findings

The performance of the same model can vary significantly depending on random initialization. Average pooling effectively handles varied-length speech inputs. A new hyper-parameter-free loss function based on P2SGrad achieved comparable or better performance than margin-based softmax losses, resulting in a best single model EER of 1.92%.

Approach

The authors compared various combinations of front-ends (LFCC, LFB, spectrogram), network architectures (LCNN, LCNN with attention, LCNN with LSTM), and loss functions (cross-entropy with sigmoid, AM-softmax, OC-softmax, and a new hyper-parameter-free loss based on P2SGrad). They evaluated the performance on the ASVspoof 2019 logical access task, conducting multiple training runs to assess intra-model variability.

Datasets

ASVspoof 2019 logical access (LA) database

Model(s)

Light Convolutional Neural Network (LCNN), LCNN with attention pooling, LCNN with Bi-LSTM layers

Author countries

Japan

← Previous