Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection

Authors: Akanksha Chuchra, Shukesh Reddy, Sudeepta Mishra, Abhijit Das, Abhinav Dhall

Published: 2026-01-02 18:17:22+00:00

Comment: Accepted at IJCB 2025

AI Summary

This paper investigates the viability of using Multimodal Large Language Models (MLLMs) for audio deepfake detection by reformulating it as an Audio Question-Answering (AQA) task. Evaluating Qwen2-Audio-7B-Instruct and SALMONN in zero-shot and LoRA fine-tuned settings, the study finds that MLLMs perform poorly without task-specific training and struggle with out-of-domain generalization. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.

Abstract

While Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes, their use for audio deepfake detection remains largely unexplored. In this work, we aim to explore the potential of MLLMs for audio deepfake detection. Combining audio inputs with a range of text prompts as queries to find out the viability of MLLMs to learn robust representations across modalities for audio deepfake detection. Therefore, we attempt to explore text-aware and context-rich, question-answer based prompts with binary decisions. We hypothesise that such a feature-guided reasoning will help in facilitating deeper multimodal understanding and enable robust feature learning for audio deepfake detection. We evaluate the performance of two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, in two evaluation modes: (a) zero-shot and (b) fine-tuned. Our experiments demonstrate that combining audio with a multi-prompt approach could be a viable way forward for audio deepfake detection. Our experiments show that the models perform poorly without task-specific training and struggle to generalise to out-of-domain data. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.


Key findings
MLLMs demonstrated poor performance in zero-shot evaluation, often close to random chance, but showed significant accuracy improvements on in-domain data (ASVspoof 19) after fine-tuning with minimal supervision. However, their generalization to out-of-domain datasets (ITW) remained limited. The fine-tuned MLLMs, especially SALMONN, achieved performance comparable to or better than classical audio deepfake detection methods on in-domain data.
Approach
The authors reformulate audio deepfake detection as an Audio Question-Answering (AQA) task, where MLLMs receive audio input alongside various text prompts (e.g., binary, context-rich) to elicit a 'bonafide' or 'spoof' response. They evaluate this approach in both zero-shot and fine-tuned modes, utilizing Low-Rank Adaptation (LoRA) for efficient adaptation of the MLLMs.
Datasets
ASVspoof 2019 Logical Access (ASV19 LA), In-the-Wild (ITW) dataset
Model(s)
Qwen2-Audio-7B-Instruct, SALMONN-13B
Author countries
India, Australia