SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

Published: 2024-05-08 17:40:12+00:00

AI Summary

This paper introduces the SVDD Challenge 2024, the first research challenge focused on singing voice deepfake detection. The challenge features two tracks, one with controlled, isolated vocals and another with in-the-wild recordings containing background music, to advance research in this specialized area.

Abstract

The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specialized field requiring focused attention. To promote SVDD research, we recently proposed the SVDD Challenge, the very first research challenge focusing on SVDD for lab-controlled and in-the-wild bonafide and deepfake singing voice recordings. The challenge will be held in conjunction with the 2024 IEEE Spoken Language Technology Workshop (SLT 2024).


Key findings
Baseline systems achieved EERs of 11.37% (LFCC) and 10.39% (raw waveform) on the CtrSVDD evaluation set. Performance on the validation set was significantly better than on the evaluation set, highlighting the challenge of generalizing to unseen deepfake generation methods. Results for the WildSVDD track are pending.
Approach
The challenge uses two tracks: CtrSVDD (controlled, clean vocals) and WildSVDD (in-the-wild recordings with background music). Participants develop systems to distinguish between bonafide and deepfake singing voices, evaluated using Equal Error Rate (EER).
Datasets
CtrSVDD: Opencpop, M4Singer, Kising, Official ACE-Studio release, Ofuton-P1, Oniku Kurumi, Kiritan, JVS-MuSiC; WildSVDD: Expanded SingFake dataset (approximately double the original size, including Korean singers).
Model(s)
Two baseline systems are provided: one using raw waveforms and another using Linear Frequency Cepstral Coefficients (LFCCs) as input features. Both employ a modified AASIST architecture.
Author countries
USA, Japan, China