ConvNeXt Based Neural Network for Audio Anti-Spoofing

View on arXiv ← Back to list

Authors: Qiaowei Ma, Jinghui Zhong, Yitao Yang, Weiheng Liu, Ying Gao, Wing W. Y. Ng

Published: 2022-09-14 05:53:37+00:00

AI Summary

This paper proposes a lightweight end-to-end audio anti-spoofing model based on a revised ConvNeXt architecture. By incorporating a channel attention block and focal loss, the model effectively focuses on informative speech sub-bands and difficult-to-classify samples, achieving state-of-the-art performance on the ASVSpoof 2019 LA dataset.

Abstract

With the rapid development of speech conversion and speech synthesis algorithms, automatic speaker verification (ASV) systems are vulnerable to spoofing attacks. In recent years, researchers had proposed a number of anti-spoofing methods based on hand-crafted features. However, using hand-crafted features rather than raw waveform will lose implicit information for anti-spoofing. Inspired by the promising performance of ConvNeXt in image classification tasks, we revise the ConvNeXt network architecture and propose a lightweight end-to-end anti-spoofing model. By integrating with the channel attention block and using the focal loss function, the proposed model can focus on the most informative sub-bands of speech representations and the difficult samples that are hard to classify. Experiments show that our proposed system could achieve an equal error rate of 0.64% and min-tDCF of 0.0187 for the ASVSpoof 2019 LA evaluation dataset, which outperforms the state-of-the-art systems.

Key findings

The proposed model achieved an equal error rate (EER) of 0.64% and a minimum tandem detection cost function (min-tDCF) of 0.0187 on the ASVSpoof 2019 LA evaluation dataset, outperforming state-of-the-art systems. The channel attention module significantly improved performance, reducing EER from 2.39% to 0.64%.

Approach

The authors revise the ConvNeXt architecture for audio anti-spoofing, integrating a channel attention block and using focal loss. This end-to-end model processes raw waveforms directly, avoiding information loss from hand-crafted feature extraction.

Datasets

ASVSpoof 2019 LA dataset

Model(s)

Revised ConvNeXt architecture with channel attention block (modified ECA module) and focal loss.

Author countries

China

← Previous