FSD: An Initial Chinese Dataset for Fake Song Detection

View on arXiv ← Back to list

Authors: Yuankun Xie, Jingjing Zhou, Xiaolin Lu, Zhenghao Jiang, Yuxin Yang, Haonan Cheng, Long Ye

Published: 2023-09-05 13:37:30+00:00

AI Summary

This paper introduces the FSD dataset, a novel Chinese Fake Song Detection dataset created using five state-of-the-art singing voice synthesis and conversion methods. Experiments show that models trained on FSD significantly outperform speech-trained models in detecting deepfake songs, achieving a 38.58% reduction in average equal error rate.

Abstract

Singing voice synthesis and singing voice conversion have significantly advanced, revolutionizing musical experiences. However, the rise of Deepfake Songs generated by these technologies raises concerns about authenticity. Unlike Audio DeepFake Detection (ADD), the field of song deepfake detection lacks specialized datasets or methods for song authenticity verification. In this paper, we initially construct a Chinese Fake Song Detection (FSD) dataset to investigate the field of song deepfake detection. The fake songs in the FSD dataset are generated by five state-of-the-art singing voice synthesis and singing voice conversion methods. Our initial experiments on FSD revealed the ineffectiveness of existing speech-trained ADD models for the task of song deepFake detection. Thus, we employ the FSD dataset for the training of ADD models. We subsequently evaluate these models under two scenarios: one with the original songs and another with separated vocal tracks. Experiment results show that song-trained ADD models exhibit a 38.58% reduction in average equal error rate compared to speech-trained ADD models on the FSD test set.

Key findings

Speech-trained ADD models performed poorly on the FSD dataset. Models trained on the FSD dataset, particularly using isolated vocal tracks, showed significant improvement in deepfake song detection. The Wav2Vec2-LCNN model achieved the lowest equal error rate (EER) of 9.52% and a 38.58% average EER reduction compared to speech-trained models.

Approach

The authors created a new dataset (FSD) of real and deepfake songs generated using five different methods. They then trained and evaluated existing audio deepfake detection (ADD) models on this dataset, comparing performance when using full songs and isolated vocal tracks. The best performing model showed a substantial reduction in error rate compared to models trained only on speech data.

Datasets

FSD (Fake Song Detection) dataset, ASVspoof2019LA, M4singer, opencpop

Model(s)

AASIST, LCNN (with Mel-spectrogram and Wav2Vec2 features)

Author countries

China

← Previous