ADD 2022: the First Audio Deep Synthesis Detection Challenge

View on arXiv ← Back to list

Authors: Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Xiaohui Zhang, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, Haizhou Li, Zheng Lian, Bin Liu

Published: 2022-02-17 03:29:20+00:00

AI Summary

The ADD 2022 challenge focuses on audio deepfake detection, addressing real-world scenarios not covered by previous tasks. It includes three tracks: low-quality fake audio detection, partially fake audio detection, and an audio fake game, providing diverse and challenging datasets for evaluating detection methods.

Abstract

Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake game (FG). The LF track focuses on dealing with bona fide and fully fake utterances with various real-world noises etc. The PF track aims to distinguish the partially fake audio from the real. The FG track is a rivalry game, which includes two tasks: an audio generation task and an audio fake detection task. In this paper, we describe the datasets, evaluation metrics, and protocols. We also report major findings that reflect the recent advances in audio deepfake detection tasks.

Key findings

The challenge revealed that existing models struggle with generalizing across various audio deepfake scenarios. The average EER across all submissions was high for each track, indicating the difficulty of the task. The results highlighted the need for more robust and generalizable detection models and a further discussion on the appropriateness of evaluation metrics.

Approach

The paper describes the ADD 2022 challenge, not a specific approach to audio deepfake detection. It outlines three tracks with different challenges (low-quality fakes, partially fake audios, and a generation/detection game) and uses metrics like Equal Error Rate (EER) and Deception Success Rate (DSR) for evaluation.

Datasets

AISHELL-1, AISHELL-3, and AISHELL-4 datasets were used to create training, development, adaptation, and test sets for the three tracks of the ADD 2022 challenge. These sets include genuine and fake audio utterances with varying levels of noise and manipulation.

Model(s)

Gaussian Mixture Model (GMM), Light Convolutional Neural Network (LCNN), and RawNet2 were used as baseline models for audio deepfake detection. Participants were free to use other models.

Author countries

China, Singapore

← Previous