Waveform Boundary Detection for Partially Spoofed Audio

View on arXiv ← Back to list

Authors: Zexin Cai, Weiqing Wang, Ming Li

Published: 2022-11-01 02:31:54+00:00

AI Summary

This paper proposes a deep learning-based system for detecting partially spoofed audio by identifying waveform boundaries between genuine and manipulated segments. The system achieves state-of-the-art performance on the ADD2022 challenge, outperforming other methods in locating manipulated audio clips.

Abstract

The present paper proposes a waveform boundary detection system for audio spoofing attacks containing partially manipulated segments. Partially spoofed/fake audio, where part of the utterance is replaced, either with synthetic or natural audio clips, has recently been reported as one scenario of audio deepfakes. As deepfakes can be a threat to social security, the detection of such spoofing audio is essential. Accordingly, we propose to address the problem with a deep learning-based frame-level detection system that can detect partially spoofed audio and locate the manipulated pieces. Our proposed method is trained and evaluated on data provided by the ADD2022 Challenge. We evaluate our detection model concerning various acoustic features and network configurations. As a result, our detection system achieves an equal error rate (EER) of 6.58% on the ADD2022 challenge test set, which is the best performance in partially spoofed audio detection systems that can locate manipulated clips.

Key findings

The proposed system achieves an equal error rate (EER) of 6.58% on the ADD2022 test set, surpassing existing methods for partially spoofed audio detection that locate manipulated clips. The use of Wav2Vec features and the ResNet-1D module proves crucial for achieving robust performance, particularly on out-of-domain data.

Approach

The authors employ a frame-level detection approach using Wav2Vec for feature extraction, a ResNet-1D for frame-level embedding, and a Transformer encoder-based classifier to predict the probability of each frame being a boundary between genuine and manipulated audio. The system identifies manipulated segments by detecting discontinuities in the waveform.

Datasets

ADD2022 Challenge datasets (ADD-train, ADD-dev, ADD-test, ADD-adaptation), Partially-fake dataset (generated from ADD-train), AISHELL-2, MUSAN, RIRs

Model(s)

Wav2Vec 2.0, ResNet-1D, Transformer encoder, BiLSTM

Author countries

USA, China

← Previous