Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models

Authors: Orchid Chetia Phukan, Sarthak Jain, Swarup Ranjan Behera, Arun Balaji Buduru, Rajesh Sharma, S. R Mahadeva Prasanna

Published: 2024-09-21 12:50:53+00:00

AI Summary

This research compares music foundation models (MFMs) and speech foundation models (SFMs) for singing voice deepfake detection (SVDD). It finds that speaker recognition SFMs perform best, and proposes a novel fusion framework, FIONA, which combines SFMs and MFMs to achieve state-of-the-art results with a 13.74% equal error rate (EER).

Abstract

In this study, for the first time, we extensively investigate whether music foundation models (MFMs) or speech foundation models (SFMs) work better for singing voice deepfake detection (SVDD), which has recently attracted attention in the research community. For this, we perform a comprehensive comparative study of state-of-the-art (SOTA) MFMs (MERT variants and music2vec) and SFMs (pre-trained for general speech representation learning as well as speaker recognition). We show that speaker recognition SFM representations perform the best amongst all the foundation models (FMs), and this performance can be attributed to its higher efficacy in capturing the pitch, tone, intensity, etc, characteristics present in singing voices. To our end, we also explore the fusion of FMs for exploiting their complementary behavior for improved SVDD, and we propose a novel framework, FIONA for the same. With FIONA, through the synchronization of x-vector (speaker recognition SFM) and MERT-v1-330M (MFM), we report the best performance with the lowest Equal Error Rate (EER) of 13.74 %, beating all the individual FMs as well as baseline FM fusions and achieving SOTA results.


Key findings
Speaker recognition SFMs significantly outperform MFMs for SVDD. The proposed FIONA framework, fusing x-vector and MERT-v1-330M, achieves state-of-the-art performance with an EER of 13.74%, outperforming individual models and baseline fusion methods. Fusion of SFMs generally shows more complementary behavior than MFMs.
Approach
The paper compares various state-of-the-art MFMs and SFMs for SVDD. It then proposes FIONA, a novel framework that fuses features from a speaker recognition SFM (x-vector) and an MFM (MERT-v1-330M) using centered kernel alignment (CKA) as a loss function to improve SVDD performance.
Datasets
CtrSVDD dataset
Model(s)
MERT variants (MERT-v1-330M, MERT-v1-95M, MERT-v0-public, MERT-v0), music2vec-v1, Unispeech-SAT, WavLM2, Wav2vec2, x-vector. FIONA framework fuses x-vector and MERT-v1-330M.
Author countries
India, Estonia