Towards Interactive Deepfake Analysis

View on arXiv ← Back to list

Authors: Lixiong Qin, Ning Jiang, Yang Zhang, Yuhan Qiu, Dingheng Zeng, Jiani Hu, Weihong Deng

Published: 2025-01-02 09:34:11+00:00

AI Summary

This paper introduces interactive deepfake analysis by instruction tuning multi-modal large language models (MLLMs). It addresses the lack of datasets and benchmarks by creating DFA-Instruct, a new instruction-following dataset, and DFA-Bench, a benchmark for evaluating MLLMs in deepfake analysis. Finally, it presents DFA-GPT, an interactive deepfake analysis system.

Abstract

Existing deepfake analysis methods are primarily based on discriminative models, which significantly limit their application scenarios. This paper aims to explore interactive deepfake analysis by performing instruction tuning on multi-modal large language models (MLLMs). This will face challenges such as the lack of datasets and benchmarks, and low training efficiency. To address these issues, we introduce (1) a GPT-assisted data construction process resulting in an instruction-following dataset called DFA-Instruct, (2) a benchmark named DFA-Bench, designed to comprehensively evaluate the capabilities of MLLMs in deepfake detection, deepfake classification, and artifact description, and (3) construct an interactive deepfake analysis system called DFA-GPT, as a strong baseline for the community, with the Low-Rank Adaptation (LoRA) module. The dataset and code will be made available at https://github.com/lxq1000/DFA-Instruct to facilitate further research.

Key findings

DFA-GPT outperforms vision-only models in deepfake detection and classification on DFA-Bench. The inclusion of artifact description annotations improves performance. Existing general-purpose MLLMs show limited deepfake analysis capabilities, highlighting the contribution of DFA-GPT.

Approach

The authors use instruction tuning on multi-modal large language models (MLLMs). They create a new dataset, DFA-Instruct, using a GPT-assisted data construction process to train their model, DFA-GPT. Low-Rank Adaptation (LoRA) is used for efficient training.

Datasets

DFA-Instruct (127.3K aligned face images and 891.6K question-answer pairs), DF-40, FF++, Celeb-DF

Model(s)

LLaVA-1.5-7B (initialized weights), Vicuna (LLM decoder), CLIP-L/14 (vision encoder), ResNet101, DeiT-B/16, DeiT-L/14, CLIP-B/16, CLIP-L/14 (for comparison)

Author countries

China

← Previous