Fully Automated End-to-End Fake Audio Detection

Authors: Chenglong Wang, Jiangyan Yi, Jianhua Tao, Haiyang Sun, Xun Chen, Zhengkun Tian, Haoxin Ma, Cunhang Fan, Ruibo Fu

Published: 2022-08-20 06:46:55+00:00

AI Summary

This paper introduces a fully automated end-to-end fake audio detection method that eliminates the need for manual feature engineering or hyperparameter tuning. It utilizes pre-trained wav2vec models for high-level speech representation combined with a novel light-DARTS architecture search for automatically optimizing the neural network structure. The proposed system achieves state-of-the-art performance on the ASVspoof 2019 LA dataset.

Abstract

The existing fake audio detection systems often rely on expert experience to design the acoustic features or manually design the hyperparameters of the network structure. However, artificial adjustment of the parameters can have a relatively obvious influence on the results. It is almost impossible to manually set the best set of parameters. Therefore this paper proposes a fully automated end-toend fake audio detection method. We first use wav2vec pre-trained model to obtain a high-level representation of the speech. Furthermore, for the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS. It learns deep speech representations while automatically learning and optimizing complex neural structures consisting of convolutional operations and residual blocks. The experimental results on the ASVspoof 2019 LA dataset show that our proposed system achieves an equal error rate (EER) of 1.08%, which outperforms the state-of-the-art single system.


Key findings
The proposed fully automated end-to-end system achieved an Equal Error Rate (EER) of 1.08% on the ASVspoof 2019 LA evaluation set, outperforming existing state-of-the-art single systems. Pre-trained wav2vec features consistently showed better performance than traditional LFCC features, and the light-DARTS architecture significantly improved results by automatically optimizing network structures, demonstrating strong generalizability across different datasets like ASVspoof 2021 DF and ADD 2022 Track 1.
Approach
The approach consists of two modules: a feature extraction module using pre-trained wav2vec models (wav2vec, wav2vec 2.0 base, wav2vec 2.0 large) to obtain high-level speech representations directly from raw waveforms. This is followed by a Light-DARTS module for classification, which is a modified differentiable architecture search method incorporating a Max Feature Map (MFM) module to automatically learn and optimize the complex neural network structure.
Datasets
ASVspoof 2019 LA, ASVspoof 2021 DF, ADD 2022 Track 1
Model(s)
wav2vec, wav2vec 2.0 (base and large), light-DARTS (based on DARTS with Max Feature Map)
Author countries
China