From Real to Cloned Singer Identification

View on arXiv ← Back to list

Authors: Dorian Desblancs, Gabriel Meseguer-Brocal, Romain Hennequin, Manuel Moussallam

Published: 2024-07-11 16:25:21+00:00

AI Summary

This paper investigates the use of singer identification methods for detecting cloned voices in music. Three embedding models trained with a singer-level contrastive learning scheme are evaluated on real and cloned voices, revealing a significant performance drop when classifying cloned voices, particularly for models using mixtures as input.

Abstract

Cloned voices of popular singers sound increasingly realistic and have gained popularity over the past few years. They however pose a threat to the industry due to personality rights concerns. As such, methods to identify the original singer in synthetic voices are needed. In this paper, we investigate how singer identification methods could be used for such a task. We present three embedding models that are trained using a singer-level contrastive learning scheme, where positive pairs consist of segments with vocals from the same singers. These segments can be mixtures for the first model, vocals for the second, and both for the third. We demonstrate that all three models are highly capable of identifying real singers. However, their performance deteriorates when classifying cloned versions of singers in our evaluation set. This is especially true for models that use mixtures as an input. These findings highlight the need to understand the biases that exist within singer identification systems, and how they can influence the identification of voice deepfakes in music.

Key findings

All three models effectively identify real singers, but performance significantly degrades on cloned voices, especially when using mixtures as input. The models using vocal stems perform considerably better on cloned voices. This highlights biases in singer identification systems related to instrumental context.

Approach

The authors train three embedding models using a singer-level contrastive learning approach. These models differ in their input: mixtures, vocal stems, or both. A classifier is then trained on the embeddings to identify singers.

Datasets

Deezer dataset (closed, containing 176,141 songs from 7500 singers, some cloned), Free Music Archive (FMA), MTG-Jamendo (MTG), and a YouTube dataset of cloned voices (377 tracks from 67 singers).

Model(s)

Transformer model (small version from [29]) for embedding generation and a fully-connected classifier for singer identification. CLMR embeddings ([37]) are used as a baseline.

Author countries

France

← Previous