Frustratingly Easy Zero-Day Audio DeepFake Detection via Retrieval Augmentation and Profile Matching

View on arXiv ← Back to list

Authors: Xuechen Liu, Xin Wang, Junichi Yamagishi

Published: 2025-09-26 00:55:45+00:00

AI Summary

This study addresses the vulnerability of modern audio deepfake detectors (ADD) to zero-day attacks generated by novel synthesis methods. It proposes a training-free framework utilizing knowledge representations, retrieval augmentation (RA), and voice profile matching. This framework achieves performance comparable to fine-tuned models on the DeepFake-Eval-2024 benchmark without requiring additional model training.

Abstract

Modern audio deepfake detectors using foundation models and large training datasets have achieved promising detection performance. However, they struggle with zero-day attacks, where the audio samples are generated by novel synthesis methods that models have not seen from reigning training data. Conventional approaches against such attacks require fine-tuning the detectors, which can be problematic when prompt response is required. This study introduces a training-free framework for zero-day audio deepfake detection based on knowledge representations, retrieval augmentation, and voice profile matching. Based on the framework, we propose simple yet effective knowledge retrieval and ensemble methods that achieve performance comparable to fine-tuned models on DeepFake-Eval-2024, without any additional model-wise training. We also conduct ablation studies on retrieval pool size and voice profile attributes, validating their relevance to the system efficacy.

Key findings

Retrieval augmentation methods using majority voting or ratio-based ensemble consistently achieve performance comparable to or better than fine-tuned baseline models on the zero-day attacks. The hybrid retrieval strategy, which incorporates voice profile information, effectively complements the CM features. Specifically, voice quality attributes within the profile feature vector are shown to be highly important for system efficacy.

Approach

The framework maintains a knowledge database storing feature representations (CM features and profile features) and labels from 'seen' data. Zero-day queries retrieve the k nearest neighbors based on feature similarity (CM-only, Profile-only, or Hybrid retrieval). A final prediction is determined using simple ensemble methods such as Majority Voting or Ratio-based Scoring over the retrieved neighbor labels/scores.

Datasets

DeepFake-Eval-2024 (DE2024)

Model(s)

Wav2Vec (Wav2Vec2) based Self-Supervised Learning Countermeasure (SSL CM) and vox profile for profile feature extraction.

Author countries

Japan

← Previous