Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception

View on arXiv ← Back to list

Authors: Yuankun Xie, Ruibo Fu, Zhiyong Wang, Xiaopeng Wang, Songjun Cao, Long Ma, Haonan Cheng, Long Ye

Published: 2025-04-09 10:18:45+00:00

AI Summary

This paper introduces a novel wavelet prompt tuning (WPT) method for all-type audio deepfake detection, significantly improving cross-type detection accuracy. WPT optimizes self-supervised learning (SSL) models by learning specialized prompt tokens in the frequency domain, requiring far fewer trainable parameters than fine-tuning.

Abstract

The rapid advancement of audio generation technologies has escalated the risks of malicious deepfake audio across speech, sound, singing voice, and music, threatening multimedia security and trust. While existing countermeasures (CMs) perform well in single-type audio deepfake detection (ADD), their performance declines in cross-type scenarios. This paper is dedicated to studying the alltype ADD task. We are the first to comprehensively establish an all-type ADD benchmark to evaluate current CMs, incorporating cross-type deepfake detection across speech, sound, singing voice, and music. Then, we introduce the prompt tuning self-supervised learning (PT-SSL) training paradigm, which optimizes SSL frontend by learning specialized prompt tokens for ADD, requiring 458x fewer trainable parameters than fine-tuning (FT). Considering the auditory perception of different audio types,we propose the wavelet prompt tuning (WPT)-SSL method to capture type-invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all-type ADD task. To achieve an universally CM, we utilize all types of deepfake audio for co-training. Experimental results demonstrate that WPT-XLSR-AASIST achieved the best performance, with an average EER of 3.58% across all evaluation sets. The code is available online.

Key findings

WPT-XLSR-AASIST achieved the best performance with an average equal error rate (EER) of 3.58% across all datasets. WPT significantly outperformed fine-tuning (FT) in all-type audio deepfake detection while requiring 458 times fewer trainable parameters. WPT demonstrated type invariance in t-SNE visualization and attention distribution, highlighting its effectiveness in handling various audio types.

Approach

The authors propose a prompt tuning self-supervised learning (PT-SSL) paradigm and extend it to wavelet prompt tuning (WPT-SSL). WPT-SSL uses a discrete wavelet transform to enhance frequency domain information, improving the model's ability to detect deepfakes across different audio types (speech, sound, singing, music). All types of deepfake audio are used for co-training.

Datasets

ASVspoof 2019LA (speech), Codecfake-A3 (sound), CtrSVDD (singing), FakeMusicCaps (music)

Model(s)

AASIST, XLSR-AASIST, MERT-AASIST, WavLM-AASIST, Spec-Resnet (with variations using FR, FT, PT, and WPT training paradigms)

Author countries

China

← Previous