Audio compression-assisted feature extraction for voice replay attack detection

Authors: Xiangyu Shi, Yuhao Luo, Li Wang, Haorui He, Hao Li, Lei Wang, Zhizheng Wu

Published: 2023-10-09 15:53:42+00:00

AI Summary

This research proposes a novel audio deepfake detection approach using audio compression. By comparing the original audio with a compressed and decompressed version, the method extracts features reflecting channel noise introduced during replay attacks. The approach achieves a state-of-the-art equal error rate (EER) of 22.71% on the ASVspoof 2021 PA evaluation set.

Abstract

Replay attack is one of the most effective and simplest voice spoofing attacks. Detecting replay attacks is challenging, according to the Automatic Speaker Verification Spoofing and Countermeasures Challenge 2021 (ASVspoof 2021), because they involve a loudspeaker, a microphone, and acoustic conditions (e.g., background noise). One obstacle to detecting replay attacks is finding robust feature representations that reflect the channel noise information added to the replayed speech. This study proposes a feature extraction approach that uses audio compression for assistance. Audio compression compresses audio to preserve content and speaker information for transmission. The missed information after decompression is expected to contain content- and speaker-independent information (e.g., channel noise added during the replay process). We conducted a comprehensive experiment with a few data augmentation techniques and 3 classifiers on the ASVspoof 2021 physical access (PA) set and confirmed the effectiveness of the proposed feature extraction approach. To the best of our knowledge, the proposed approach achieves the lowest EER at 22.71% on the ASVspoof 2021 PA evaluation set.


Key findings
The proposed audio compression-assisted feature extraction significantly improves deepfake detection performance. The lowest EER achieved is 22.71% on the ASVspoof 2021 PA evaluation set, surpassing previous state-of-the-art results. OCSVM and AnoGAN classifiers generally outperformed VAE.
Approach
The method uses audio compression (Opus codec) to compress and decompress audio. The difference between the original and processed audio, in the form of Mel-spectrograms, is used as a feature for detection. Various one-class classifiers (VAE, OCSVM, AnoGAN) are trained on this feature to distinguish between genuine and spoofed speech.
Datasets
ASVspoof 2019 and ASVspoof 2021 physical access (PA) datasets. Training was performed on the ASVspoof 2019 PA training and development sets, and evaluation on the ASVspoof 2021 PA progress and evaluation sets.
Model(s)
Variational Autoencoder (VAE), One-class Support Vector Machine (OCSVM), AnoGAN (a deep convolutional generative adversarial network)
Author countries
China, Singapore