FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis Assistant

View on arXiv ← Back to list

Authors: Zhengchao Huang, Bin Xia, Zicheng Lin, Zhun Mou, Wenming Yang, Jiaya Jia

Published: 2024-08-19 15:15:20+00:00

AI Summary

This paper introduces a novel Open-World Face Forgery Analysis VQA (OW-FFA-VQA) task and benchmark, addressing limitations in existing methods. It proposes FFAA, a multimodal large language model-based system that uses a fine-tuned MLLM and a Multi-answer Intelligent Decision System (MIDS) to improve accuracy and robustness while providing user-friendly, explainable results.

Abstract

The rapid advancement of deepfake technologies has sparked widespread public concern, particularly as face forgery poses a serious threat to public information security. However, the unknown and diverse forgery techniques, varied facial features and complex environmental factors pose significant challenges for face forgery analysis. Existing datasets lack descriptive annotations of these aspects, making it difficult for models to distinguish between real and forged faces using only visual information amid various confounding factors. In addition, existing methods fail to yield user-friendly and explainable results, hindering the understanding of the model's decision-making process. To address these challenges, we introduce a novel Open-World Face Forgery Analysis VQA (OW-FFA-VQA) task and its corresponding benchmark. To tackle this task, we first establish a dataset featuring a diverse collection of real and forged face images with essential descriptions and reliable forgery reasoning. Based on this dataset, we introduce FFAA: Face Forgery Analysis Assistant, consisting of a fine-tuned Multimodal Large Language Model (MLLM) and Multi-answer Intelligent Decision System (MIDS). By integrating hypothetical prompts with MIDS, the impact of fuzzy classification boundaries is effectively mitigated, enhancing model robustness. Extensive experiments demonstrate that our method not only provides user-friendly and explainable results but also significantly boosts accuracy and robustness compared to previous methods.

Key findings

FFAA significantly outperforms state-of-the-art methods in accuracy and robustness on the OW-FFA-Bench. The use of hypothetical prompts and MIDS effectively mitigates the impact of fuzzy classification boundaries. The method provides user-friendly and explainable results, enhancing its practical applicability.

Approach

FFAA tackles face forgery detection by framing it as a visual question answering (VQA) task. It uses a fine-tuned multimodal large language model (MLLM) to generate answers under different hypotheses (real or fake), and a Multi-answer Intelligent Decision System (MIDS) selects the best answer, improving robustness.

Datasets

OW-FFA-Bench (a benchmark composed of seven public datasets), FFA-VQA (a new dataset created using GPT-4 assisted data generation), Multi-attack (MA) dataset (created from FF++, Celeb-DF-v2, DFFD, and GanDiffFace)

Model(s)

Fine-tuned Multimodal Large Language Model (MLLM, specifically LLaVA-v1.6-mistral-7B), Multi-answer Intelligent Decision System (MIDS) using CLIP-ViT-L/14 and T5-Encoder

Author countries

China, Hong Kong

← Previous