Self-supervised pre-training with acoustic configurations for replay spoofing detection

Authors: Hye-jin Shim, Hee-Soo Heo, Jee-weon Jung, Ha-Jin Yu

Published: 2019-10-22 05:54:41+00:00

AI Summary

This paper proposes a self-supervised pre-training framework for acoustic configurations to improve replay spoofing detection. It leverages datasets from other tasks (like speaker verification) to train a deep neural network to identify whether audio segments share identical acoustic configurations, improving generalization to unseen conditions.

Abstract

Constructing a dataset for replay spoofing detection requires a physical process of playing an utterance and re-recording it, presenting a challenge to the collection of large-scale datasets. In this study, we propose a self-supervised framework for pretraining acoustic configurations using datasets published for other tasks, such as speaker verification. Here, acoustic configurations refer to the environmental factors generated during the process of voice recording but not the voice itself, including microphone types, place and ambient noise levels. Specifically, we select pairs of segments from utterances and train deep neural networks to determine whether the acoustic configurations of the two segments are identical. We validate the effectiveness of the proposed method based on the ASVspoof 2019 physical access dataset utilizing two well-performing systems. The experimental results demonstrate that the proposed method outperforms the baseline approach by 30%.


Key findings
The proposed self-supervised pre-training method improved the equal error rate (EER) by 30% compared to the baseline. The performance further improved with increased pre-training data and using the LCNN architecture. Freezing layers during fine-tuning was found to be detrimental.
Approach
The approach uses self-supervised learning to pre-train a deep neural network on acoustic configurations (environmental factors in recordings). This pre-trained network is then fine-tuned for replay spoofing detection using a labeled dataset. The network learns to identify whether audio segments have similar acoustic configurations based on cosine similarity.
Datasets
VoxCeleb1&2 (for pre-training), ASVspoof 2019 physical access dataset (for replay spoofing detection)
Model(s)
Modified end-to-end (E2E) DNN architecture (CNN) and Light Convolutional Neural Networks (LCNNs)
Author countries
Republic of Korea