Deepfake Detection of Singing Voices With Whisper Encodings

View on arXiv ← Back to list

Authors: Falguni Sharma, Priyanka Gupta

Published: 2025-01-31 06:43:50+00:00

AI Summary

This paper proposes a singing voice deepfake detection (SVDD) system using noise-variant encodings from OpenAI's Whisper model. The system leverages the non-speech information encoded by Whisper, even though it's a noise-robust model, to differentiate between real and fake singing voices. Performance is evaluated using Equal Error Rate (EER).

Abstract

The deepfake generation of singing vocals is a concerning issue for artists in the music industry. In this work, we propose a singing voice deepfake detection (SVDD) system, which uses noise-variant encodings of open-AI's Whisper model. As counter-intuitive as it may sound, even though the Whisper model is known to be noise-robust, the encodings are rich in non-speech information, and are noise-variant. This leads us to evaluate Whisper encodings as feature representations for the SVDD task. Therefore, in this work, the SVDD task is performed on vocals and mixtures, and the performance is evaluated in %EER over varying Whisper model sizes and two classifiers- CNN and ResNet34, under different testing conditions.

Key findings

The Whisper (medium) model with ResNet34 achieved the best performance across various testing conditions on the Singfake dataset. The proposed approach significantly outperformed existing methods, especially in scenarios with unseen languages and musical contexts, although these conditions remain challenging. Results show that detecting deepfakes in vocals is easier than in mixtures.

Approach

The proposed SVDD system uses Whisper's encoder to extract noise-variant features from audio input. These features, along with CNN or ResNet34 classifiers, are used to distinguish between bonafide and deepfake singing voices in both isolated vocals and mixtures with background music.

Datasets

Singfake dataset, including vocals (isolated singing voices) and mixtures (singing voices with background music).

Model(s)

Whisper model (tiny, small, base, medium variants), CNN, ResNet34.

Author countries

India, India

← Previous