Fully Automated End-to-End Fake Audio Detection

View on arXiv ← Back to list

Authors: Chenglong Wang, Jiangyan Yi, Jianhua Tao, Haiyang Sun, Xun Chen, Zhengkun Tian, Haoxin Ma, Cunhang Fan, Ruibo Fu

Published: 2022-08-20 06:46:55+00:00

AI Summary

This paper proposes a fully automated end-to-end fake audio detection method using a wav2vec pre-trained model for feature extraction and a modified DARTS (light-DARTS) for architecture search and optimization. The method achieves a state-of-the-art equal error rate (EER) of 1.08% on the ASVspoof 2019 LA dataset, outperforming existing single systems.

Abstract

The existing fake audio detection systems often rely on expert experience to design the acoustic features or manually design the hyperparameters of the network structure. However, artificial adjustment of the parameters can have a relatively obvious influence on the results. It is almost impossible to manually set the best set of parameters. Therefore this paper proposes a fully automated end-toend fake audio detection method. We first use wav2vec pre-trained model to obtain a high-level representation of the speech. Furthermore, for the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS. It learns deep speech representations while automatically learning and optimizing complex neural structures consisting of convolutional operations and residual blocks. The experimental results on the ASVspoof 2019 LA dataset show that our proposed system achieves an equal error rate (EER) of 1.08%, which outperforms the state-of-the-art single system.

Key findings

The proposed fully automated system significantly outperforms state-of-the-art single systems, achieving an EER of 1.08% on ASVspoof 2019 LA. The use of wav2vec features and the light-DARTS architecture are key to this improvement. The method also generalizes well to other datasets, such as ASVspoof 2021 DF and ADD 2022.

Approach

The approach uses a wav2vec pre-trained model to extract high-level speech representations. A modified DARTS, named light-DARTS, automatically learns and optimizes the neural network architecture, including convolutional operations and residual blocks, for fake audio classification. The model incorporates a Max Feature Map (MFM) module for feature selection.

Datasets

ASVspoof 2019 LA dataset, ASVspoof 2021 DF dataset, ADD 2022 challenge database (Track 1)

Model(s)

wav2vec (large), wav2vec 2.0 (base and large), light-DARTS (a modified version of DARTS), LCNN

Author countries

China

← Previous