Fake-in-Facext: Towards Fine-Grained Explainable DeepFake Analysis

View on arXiv ← Back to list

Authors: Lixiong Qin, Yang Zhang, Mei Wang, Jiani Hu, Weihong Deng, Weiran Xu

Published: 2025-10-23 13:16:12+00:00

AI Summary

The Fake-in-Facext (FiFa) framework is proposed to enhance fine-grained awareness in Explainable DeepFake Analysis (XDFA) by grounding MLLM responses in Face Visual Context (Facext). It introduces the Artifact-Grounding Explanation (AGE) task, which requires generating textual forgery explanations interleaved with corresponding artifact segmentation masks. To support this, the authors developed the FiFa-Annotator pipeline, using a Facial Image Concept Tree (FICT) to construct the large-scale FiFa-Instruct-1M training dataset.

Abstract

The advancement of Multimodal Large Language Models (MLLMs) has bridged the gap between vision and language tasks, enabling the implementation of Explainable DeepFake Analysis (XDFA). However, current methods suffer from a lack of fine-grained awareness: the description of artifacts in data annotation is unreliable and coarse-grained, and the models fail to support the output of connections between textual forgery explanations and the visual evidence of artifacts, as well as the input of queries for arbitrary facial regions. As a result, their responses are not sufficiently grounded in Face Visual Context (Facext). To address this limitation, we propose the Fake-in-Facext (FiFa) framework, with contributions focusing on data annotation and model construction. We first define a Facial Image Concept Tree (FICT) to divide facial images into fine-grained regional concepts, thereby obtaining a more reliable data annotation pipeline, FiFa-Annotator, for forgery explanation. Based on this dedicated data annotation, we introduce a novel Artifact-Grounding Explanation (AGE) task, which generates textual forgery explanations interleaved with segmentation masks of manipulated artifacts. We propose a unified multi-task learning architecture, FiFa-MLLM, to simultaneously support abundant multimodal inputs and outputs for fine-grained Explainable DeepFake Analysis. With multiple auxiliary supervision tasks, FiFa-MLLM can outperform strong baselines on the AGE task and achieve SOTA performance on existing XDFA datasets. The code and data will be made open-source at https://github.com/lxq1000/Fake-in-Facext.

Key findings

FiFa-MLLM significantly outperforms the strong baseline GLaMM across almost all tasks defined in FiFa-11, demonstrating the effectiveness of the fine-grained approach. The framework achieves State-of-the-Art (SOTA) results on existing XDFA benchmarks like DD-VQA and DFA-Bench. Furthermore, the FiFa-Annotator pipeline produces data that yields superior DeepFake Detection performance compared to current automated annotation methods.

Approach

The proposed FiFa-MLLM is a unified multi-task learning architecture based on MLLMs, designed to handle the 11 tasks (FiFa-11) of fine-grained XDFA, including the novel AGE task. It uses a single global visual encoder (FaRL-ViT-B) and a Multi-Task Decoder for simultaneous explanation generation and pixel grounding (mask prediction), enhanced by auxiliary supervision of Region Mask Prediction.

Datasets

FiFa-Instruct-1M, FiFa-Bench, DFFD (Diverse Fake Face Dataset), DD-VQA, DFA-Bench.

Model(s)

FiFa-MLLM (unified multi-task learning architecture), ViT-B (FaRL pre-trained) Face Encoder, Vicuna (7B) Large Language Model (LLM).

Author countries

China

← Previous