Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception

Authors: Yuankun Xie, Ruibo Fu, Zhiyong Wang, Xiaopeng Wang, Songjun Cao, Long Ma, Haonan Cheng, Long Ye

Published: 2025-04-09 10:18:45+00:00

Comment: Accepted to AAAI 2026

AI Summary

This paper addresses the all-type audio deepfake detection (ADD) task by establishing a comprehensive benchmark covering speech, sound, singing voice, and music. It introduces the prompt tuning self-supervised learning (PT-SSL) paradigm and the wavelet prompt tuning (WPT)-SSL method, which leverages wavelet transforms to capture type-invariant frequency domain information, significantly reducing trainable parameters compared to fine-tuning. The proposed WPT-XLSR-AASIST achieves superior performance in detecting all types of deepfake audio.

Abstract

The rapid advancement of audio generation technologies has escalated the risks of malicious deepfake audio across speech, sound, singing voice, and music, threatening multimedia security and trust. While existing countermeasures (CMs) perform well in single-type audio deepfake detection (ADD), their performance declines in cross-type scenarios. This paper is dedicated to studying the all-type ADD task. We are the first to comprehensively establish an all-type ADD benchmark to evaluate current CMs, incorporating cross-type deepfake detection across speech, sound, singing voice, and music. Then, we introduce the prompt tuning self-supervised learning (PT-SSL) training paradigm, which optimizes SSL front-end by learning specialized prompt tokens for ADD, requiring 458x fewer trainable parameters than fine-tuning (FT). Considering the auditory perception of different audio types, we propose the wavelet prompt tuning (WPT)-SSL method to capture type-invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all-type ADD task. To achieve an universally CM, we utilize all types of deepfake audio for co-training. Experimental results demonstrate that WPT-XLSR-AASIST achieved the best performance, with an average EER of 3.58% across all evaluation sets.


Key findings
The WPT-XLSR-AASIST model achieved the best performance with an average Equal Error Rate (EER) of 3.58% across all evaluation sets in the co-trained all-type ADD task. The WPT-SSL method significantly reduced trainable parameters by 458 times compared to fine-tuning while outperforming it. Interpretability analysis showed that WPT enables type-invariant deepfake detection by focusing attention on specific high-frequency wavelet tokens.
Approach
The authors propose a Prompt Tuning Self-Supervised Learning (PT-SSL) paradigm to efficiently adapt SSL front-ends for ADD by learning specialized prompt tokens while freezing most of the SSL parameters. They further introduce Wavelet Prompt Tuning (WPT)-SSL, which applies Discrete Wavelet Transform (DWT) to a portion of these prompt tokens to enhance full-frequency perception and capture type-invariant deepfake information from the frequency domain.
Datasets
Speech-19LA, Sound-Codecfake-A3, Singing voice-CtrSVDD, Music-FakeMusicCaps
Model(s)
XLSR (wav2vec2-xls-r), WavLM, MERT, AASIST, ResNet18
Author countries
China