EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

View on arXiv ← Back to list

Authors: Tong Zhang, Yihuan Huang, Yanzhen Ren

Published: 2025-10-22 09:34:31+00:00

AI Summary

Existing speech deepfake detection systems exhibit severe performance degradation, with accuracy dropping dramatically when evaluated on realistic physical replay attacks. To counter this vulnerability, the authors introduce EchoFake, a novel and comprehensive dataset comprising over 120 hours of zero-shot TTS speech and physical replay recordings collected under varied real-world acoustic settings. Evaluation shows that models trained on EchoFake achieve superior generalization and robustness across multiple standard benchmarks.

Abstract

The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail when confronted with physical replay attacks-a common and low-cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting-edge zero-shot text-to-speech (TTS) speech and physical replay recordings collected under varied devices and real-world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real-world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.

Key findings

Baseline models trained on conventional datasets exhibit critical vulnerability to EchoFake's open-set replay conditions, showing an average EER of up to 48%, with replayed bona fide speech being particularly difficult to classify. Models retrained on the diverse EchoFake dataset achieved lower weighted average EERs (as low as 16.79% for Wav2Vec2), demonstrating significantly improved generalization across multiple standard and new spoofing benchmarks. Ablation studies confirm that the inclusion of physical replay data in training is crucial for enhancing robustness against replay-based attacks.

Approach

The main contribution is the construction of the EchoFake dataset, which integrates synthetic speech generated by 11 zero-shot TTS models with diverse physical replay attacks. Replay data acquisition utilized systematic variations in playback devices, recording devices, environments, and microphone-speaker distances (yielding 20 unique conditions). The dataset is used to train and evaluate baseline detection models (RawNet2, AASIST, Wav2Vec2) to demonstrate the challenge of real-world replay attacks and the improved generalization gained from training on EchoFake.

Datasets

EchoFake, CommonVoice 17.0, ASVspoof 2019 LA, ASVspoof 2021 LA, ASVspoof 2021 DF, In-the-Wild, WaveFake

Model(s)

RawNet2, AASIST, Wav2Vec2

Author countries

China

← Previous