The DKU-DUKEECE System for the Manipulation Region Location Task of ADD 2023

Authors: Zexin Cai, Weiqing Wang, Yikang Wang, Ming Li

Published: 2023-08-20 14:29:04+00:00

AI Summary

This paper presents a system for the Audio Deepfake Detection Challenge (ADD 2023) Track 2, focusing on locating manipulated regions in audio. The system integrates three models: a boundary detection model, an anti-spoofing detection model, and a VAE model, achieving first place with a final ADD score of 0.6713.

Abstract

This paper introduces our system designed for Track 2, which focuses on locating manipulated regions, in the second Audio Deepfake Detection Challenge (ADD 2023). Our approach involves the utilization of multiple detection systems to identify splicing regions and determine their authenticity. Specifically, we train and integrate two frame-level systems: one for boundary detection and the other for deepfake detection. Additionally, we employ a third VAE model trained exclusively on genuine data to determine the authenticity of a given audio clip. Through the fusion of these three systems, our top-performing solution for the ADD challenge achieves an impressive 82.23% sentence accuracy and an F1 score of 60.66%. This results in a final ADD score of 0.6713, securing the first rank in Track 2 of ADD 2023.


Key findings
The integrated system achieved first place in ADD 2023 Track 2, with a final ADD score of 0.6713. This score is based on a sentence accuracy of 82.23% and an F1 score of 60.66%. The use of multiple models and a sophisticated scoring strategy proved crucial for high performance.
Approach
The approach uses three models: one for detecting splicing boundaries, another for classifying audio segments as genuine or fake, and a VAE model for outlier detection. These models are combined using a scoring strategy to determine the location and authenticity of manipulated regions.
Datasets
ADD-Train, ADD-Dev, ADD-Test, and an additional out-of-domain dataset ADD-Eval created by re-synthesizing segments from ADD-Dev using the World vocoder.
Model(s)
Wav2Vec 2.0, WavLM, Variational Autoencoder (VAE), 1D-CNN, ResNet-1D, Transformer encoder, Bidirectional LSTM (BLSTM).
Author countries
USA, China