ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection

View on arXiv ← Back to list

Authors: Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, Zhengqi Wen

Published: 2025-05-16 10:10:03+00:00

AI Summary

This paper proposes ALLM4ADD, a framework that leverages Audio Large Language Models (ALLMs) for audio deepfake detection by reformulating the task as an audio question answering problem. Supervised fine-tuning enhances the ALLM's ability to classify audio as real or fake, achieving superior performance, especially in data-scarce scenarios.

Abstract

Audio deepfake detection (ADD) has grown increasingly important due to the rise of high-fidelity audio generative models and their potential for misuse. Given that audio large language models (ALLMs) have made significant progress in various audio processing tasks, a heuristic question arises: textit{Can ALLMs be leveraged to solve ADD?}. In this paper, we first conduct a comprehensive zero-shot evaluation of ALLMs on ADD, revealing their ineffectiveness. To this end, we propose ALLM4ADD, an ALLM-driven framework for ADD. Specifically, we reformulate ADD task as an audio question answering problem, prompting the model with the question: ``Is this audio fake or real?''. We then perform supervised fine-tuning to enable the ALLM to assess the authenticity of query audio. Extensive experiments are conducted to demonstrate that our ALLM-based method can achieve superior performance in fake audio detection, particularly in data-scarce scenarios. As a pioneering study, we anticipate that this work will inspire the research community to leverage ALLMs to develop more effective ADD systems. Code is available at https://github.com/ucas-hao/qwen_audio_for_add.git

Key findings

ALLM4ADD significantly outperforms existing conventional pipeline and end-to-end models in audio deepfake detection. It demonstrates strong performance even with limited training data, achieving an EER below 5% and accuracy over 96% with only around 200 training samples. The trainability of the audio encoder further improves performance.

Approach

ALLM4ADD reformulates audio deepfake detection as an audio question answering problem. The model is prompted with "Is this audio fake or real?" and fine-tuned using a supervised learning approach on a dataset of audio paired with the correct answer (Fake/Real). Low-Rank Adaptation (LoRA) is used to efficiently fine-tune the ALLM.

Datasets

ASVspoof2019 LA dataset, In-the-Wild dataset, EmoFake dataset, SceneFake dataset

Model(s)

Qwen-audio-chat, Qwen-audio-base, Whisper (as audio encoder), with LoRA for efficient fine-tuning.

Author countries

China

← Previous