Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning

Authors: Yuankun Xie, Xiaoxuan Guo, Jiayi Zhou, Tao Wang, Jian Liu, Ruibo Fu, Xiaopeng Wang, Haonan Cheng, Long Ye

Published: 2026-01-06 12:50:02+00:00

AI Summary

This paper addresses the need for all-type audio deepfake detection (ADD) that generalizes across heterogeneous audio and provides interpretable decisions. The authors propose an automatic annotation pipeline to construct Frequency-Time (FT) structured Chain-of-Thought (CoT) rationales, generating ~340K cold-start demonstrations. Building on this data, they introduce Frequency Time-Group Relative Policy Optimization (FT-GRPO), a two-stage training paradigm combining SFT cold-start with GRPO under rule-based frequency-time constraints, achieving state-of-the-art performance and interpretable rationales.

Abstract

Recent advances in audio large language models (ALLMs) have made high-quality synthetic audio widely accessible, increasing the risk of malicious audio deepfakes across speech, environmental sounds, singing voice, and music. Real-world audio deepfake detection (ADD) therefore requires all-type detectors that generalize across heterogeneous audio and provide interpretable decisions. Given the strong multi-task generalization ability of ALLMs, we first investigate their performance on all-type ADD under both supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). However, SFT using only binary real/fake labels tends to reduce the model to a black-box classifier, sacrificing interpretability. Meanwhile, vanilla RFT under sparse supervision is prone to reward hacking and can produce hallucinated, ungrounded rationales. To address this, we propose an automatic annotation and polishing pipeline that constructs Frequency-Time structured chain-of-thought (CoT) rationales, producing ~340K cold-start demonstrations. Building on CoT data, we propose Frequency Time-Group Relative Policy Optimization (FT-GRPO), a two-stage training paradigm that cold-starts ALLMs with SFT and then applies GRPO under rule-based frequency-time constraints. Experiments demonstrate that FT-GRPO achieves state-of-the-art performance on all-type ADD while producing interpretable, FT-grounded rationales. The data and code are available online.

Key findings

FT-GRPO achieves state-of-the-art performance on all-type ADD tasks, with a 3B model trained on speech data reaching 99.75% accuracy on ASVspoof2019LA. When co-trained on all audio types, FT-GRPO achieves a 90.10% average accuracy across all test sets, while also producing interpretable, Frequency-Time grounded rationales that improve explainability.

Approach

The authors develop an automatic annotation and polishing pipeline to generate Frequency-Time structured Chain-of-Thought rationales for audio deepfake detection. This data is then used in FT-GRPO, a two-stage training paradigm where an Audio LLM is first cold-started with Supervised Fine-Tuning (SFT) to learn the FT reasoning schema. Subsequently, Group Relative Policy Optimization (GRPO) is applied under rule-based frequency-time constraints with a composite reward function, also leveraging samples without reliable annotations.

Datasets

ASVspoof2019LA (Speech-19LA), ESDD (Sound), CtrSVDD (Singing Voice), FakeMusicCaps (Music)

Model(s)

Qwen2-Audio-Chat-7B, Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, Qwen3-Omni-Captioner, Qwen3-235B, LoRA (Low-Rank Adaptation)

Author countries

China

← Previous