Detecting music deepfakes is easy but actually hard

Authors: Darius Afchar, Gabriel Meseguer-Brocal, Romain Hennequin

Published: 2024-05-07 10:39:19+00:00

AI Summary

This paper introduces the first music deepfake detector, achieving surprisingly high accuracy (99.8%) using convolutional neural networks trained on real and auto-encoded audio. However, it emphasizes the limitations of this approach, highlighting the need for further research into robustness, generalization, calibration, and interpretability.

Abstract

In the face of a new era of generative models, the detection of artificially generated content has become a matter of utmost importance. The ability to create credible minute-long music deepfakes in a few seconds on user-friendly platforms poses a real threat of fraud on streaming services and unfair competition to human artists. This paper demonstrates the possibility (and surprising ease) of training classifiers on datasets comprising real audio and fake reconstructions, achieving a convincing accuracy of 99.8%. To our knowledge, this marks the first publication of a music deepfake detector, a tool that will help in the regulation of music forgery. Nevertheless, informed by decades of literature on forgery detection in other fields, we stress that a good test score is not the end of the story. We step back from the straightforward ML framework and expose many facets that could be problematic with such a deployed detector: calibration, robustness to audio manipulation, generalisation to unseen models, interpretability and possibility for recourse. This second part acts as a position for future research steps in the field and a caveat to a flourishing market of fake content checkers.


Key findings
While achieving high accuracy in detecting music deepfakes, the model shows a lack of robustness to common audio manipulations and poor generalization to unseen autoencoders. The authors emphasize the importance of addressing calibration and interpretability issues for ethical and reliable deployment.
Approach
The authors train convolutional neural networks to distinguish between real audio and audio reconstructed from autoencoders. This approach controls for confounding factors like genre and bitrate, focusing on detecting artifacts specific to the audio generation process.
Datasets
FMA dataset (medium split, 25,000 tracks), with auto-encoded versions generated using Encodec, DAC, GriffinMel, and Musika decoders.
Model(s)
Convolutional neural networks with six convolutional layers and a small number of parameters (1.6M).
Author countries
France