Echoes: A semantically-aligned music deepfake detection dataset

Authors: Octavian Pascu, Dan Oneata, Horia Cucu, Nicolas M. Muller

Published: 2026-03-24 19:10:42+00:00

AI Summary

This paper introduces Echoes, a new semantically-aligned dataset for music deepfake detection, comprising 3,577 tracks (110 hours) generated by ten diverse AI music systems. Designed to prevent shortcut learning, Echoes enforces semantic alignment by conditioning generated audio on bona fide waveforms or song descriptors. Evaluations show Echoes is a challenging benchmark, with models trained on it demonstrating superior generalization compared to existing datasets.

Abstract

We introduce Echoes, a new dataset for music deepfake detection designed for training and benchmarking detectors under realistic and provider-diverse conditions. Echoes comprises 3,577 tracks (110 hours of audio) spanning multiple genres (pop, rock, electronic), and includes content generated by ten popular AI music generation systems. To prevent shortcut learning and promote robust generalization, the dataset is deliberately constructed to be challenging, enforcing semantic-level alignment between spoofed audio and bona fide references. This alignment is achieved by conditioning generated audio samples directly on bona-fide waveforms or song descriptors. We evaluate Echoes in a cross-dataset setting against three existing AI-generated music datasets using state-of-the-art Wav2Vec2 XLS-R 2B representations. Results show that (i) Echoes is the hardest in-domain dataset; (ii) detectors trained on existing datasets transfer poorly to Echoes; (iii) training on Echoes yields the strongest generalization performance. These findings suggest that provider diversity and semantic alignment help learn more transferable detection cues.


Key findings
Echoes is identified as the hardest in-domain dataset for music deepfake detection, exhibiting a 9.36% EER compared to existing benchmarks. Detectors trained on other datasets (AIME, SONICS, FakeMusicCaps) transfer poorly to Echoes, achieving significantly higher EERs (28.6% - 41.7%). Conversely, training on Echoes yields the strongest generalization performance across other datasets, suggesting that provider diversity and semantic alignment help learn more transferable detection cues.
Approach
The authors created the Echoes dataset by taking bona fide music from the Free Music Archive, generating stylistic descriptions using an LLM (ChatGPT-5.0 Thinking), and then using these descriptions and, where supported, reference audio to prompt ten popular AI music generators. For deepfake detection, they used a self-supervised learning front-end (Wav2Vec2 XLS-R 2B) to extract embeddings, followed by a logistic regression classifier, evaluating performance using Equal Error Rate (EER) in both in-domain and cross-dataset settings.
Datasets
Echoes (new dataset), AIME, SONICS, FakeMusicCaps, Free Music Archive (for bona fide tracks)
Model(s)
Wav2Vec2 XLS-R 2B (frozen encoder for embeddings), Logistic Regression (classifier)
Author countries
Romania, Germany