Prompt Tuning for Audio Deepfake Detection: Computationally Efficient Test-time Domain Adaptation with Limited Target Dataset

View on arXiv ← Back to list

Authors: Hideyuki Oiso, Yuto Matsunaga, Kazuya Kakizaki, Taiki Miyagawa

Published: 2024-10-13 15:07:35+00:00

AI Summary

This paper proposes a computationally efficient method for audio deepfake detection using prompt tuning, addressing challenges of source-target domain gaps, limited target datasets, and high computational costs associated with large pre-trained models. Prompt tuning acts as a plug-in, seamlessly integrating with state-of-the-art transformer models to improve performance on target data with minimal additional parameters and computational overhead.

Abstract

We study test-time domain adaptation for audio deepfake detection (ADD), addressing three challenges: (i) source-target domain gaps, (ii) limited target dataset size, and (iii) high computational costs. We propose an ADD method using prompt tuning in a plug-in style. It bridges domain gaps by integrating it seamlessly with state-of-the-art transformer models and/or with other fine-tuning methods, boosting their performance on target data (challenge (i)). In addition, our method can fit small target datasets because it does not require a large number of extra parameters (challenge (ii)). This feature also contributes to computational efficiency, countering the high computational costs typically associated with large-scale pre-trained models in ADD (challenge (iii)). We conclude that prompt tuning for ADD under domain gaps presents a promising avenue for enhancing accuracy with minimal target data and negligible extra computational burden.

Key findings

Prompt tuning significantly improved or maintained the equal error rate (EER) across multiple target domains with varying domain gaps. The method effectively addressed overfitting issues with small target datasets (even as small as 10 samples), and the additional computational cost remained negligible compared to full fine-tuning. The performance showed rapid saturation with increasing prompt length, suggesting a short prompt length is sufficient.

Approach

The authors use prompt tuning, a parameter-efficient fine-tuning technique, to adapt pre-trained audio deepfake detection models to new, limited target datasets. A small number of trainable parameters (prompts) are inserted into the input features of the pre-trained model and fine-tuned on the target data, minimizing overfitting and computational costs.

Datasets

ASVspoof 2019 LA (source), In-The-Wild, Hamburg Adult Bilingual LAnguage (HABLA), ASVspoof 2021 LA, Voice Conversion Challenge (VCC) 2020 (targets)

Model(s)

wav2vec 2.0 and AASIST (W2V), Whisper and MesoNet (WSP). The prompt tuning method is compatible with other transformer-based models.

Author countries

Japan

← Previous