Audio compression-assisted feature extraction for voice replay attack detection

Authors: Xiangyu Shi, Yuhao Luo, Li Wang, Haorui He, Hao Li, Lei Wang, Zhizheng Wu

Published: 2023-10-09 15:53:42+00:00

AI Summary

This study proposes an audio compression-assisted feature extraction approach for voice replay attack detection. By utilizing the 'missed information' after audio decompression as content- and speaker-independent channel noise, the method aims to robustly detect spoofing. The proposed approach achieved the lowest Equal Error Rate (EER) of 22.71% on the ASVspoof 2021 Physical Access (PA) evaluation set, demonstrating its effectiveness.

Abstract

Replay attack is one of the most effective and simplest voice spoofing attacks. Detecting replay attacks is challenging, according to the Automatic Speaker Verification Spoofing and Countermeasures Challenge 2021 (ASVspoof 2021), because they involve a loudspeaker, a microphone, and acoustic conditions (e.g., background noise). One obstacle to detecting replay attacks is finding robust feature representations that reflect the channel noise information added to the replayed speech. This study proposes a feature extraction approach that uses audio compression for assistance. Audio compression compresses audio to preserve content and speaker information for transmission. The missed information after decompression is expected to contain content- and speaker-independent information (e.g., channel noise added during the replay process). We conducted a comprehensive experiment with a few data augmentation techniques and 3 classifiers on the ASVspoof 2021 physical access (PA) set and confirmed the effectiveness of the proposed feature extraction approach. To the best of our knowledge, the proposed approach achieves the lowest EER at 22.71% on the ASVspoof 2021 PA evaluation set.


Key findings
The proposed audio compression-assisted feature extraction significantly improved detection performance, achieving a state-of-the-art EER of 22.71% on the ASVspoof 2021 PA evaluation set. One-Class SVM and AnoGAN classifiers demonstrated superior performance and generalization compared to VAE for this task. The choice of appropriate bitrate for audio compression was found to be crucial for optimal results.
Approach
The approach extracts features by first resynthesizing audio with a WORLD vocoder, then compressing and decompressing it using the Opus codec. The difference between the Mel-spectrograms of the original and the Opus-processed audio forms the feature representation, which is then fed into one-class classifiers for replay attack detection.
Datasets
ASVspoof 2019 PA (for training/development) and ASVspoof 2021 PA (for progress/evaluation)
Model(s)
Variational Auto-Encoder (VAE), One-Class Support Vector Machine (OCSVM), AnoGAN (a deep convolutional generative adversarial network)
Author countries
China, Singapore