MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution

View on arXiv ← Back to list

Authors: Siran Peng, Zipei Wang, Li Gao, Xiangyu Zhu, Tianshuo Zhang, Ajian Liu, Haoyuan Zhang, Zhen Lei

Published: 2025-05-04 06:58:21+00:00

AI Summary

This paper introduces VLF-FFD, a vision-language fusion framework for face forgery detection. It enhances existing methods by integrating an external detector with a Multimodal Large Language Model (MLLM) via a novel Vision-Language Fusion Network (VLF-Net) and a new explainability-driven dataset (EFF++). VLF-FFD achieves state-of-the-art performance on multiple benchmarks.

Abstract

Reliable face forgery detection algorithms are crucial for countering the growing threat of deepfake-driven disinformation. Previous research has demonstrated the potential of Multimodal Large Language Models (MLLMs) in identifying manipulated faces. However, existing methods typically depend on either the Large Language Model (LLM) alone or an external detector to generate classification results, which often leads to sub-optimal integration of visual and textual modalities. In this paper, we propose VLF-FFD, a novel Vision-Language Fusion solution for MLLM-enhanced Face Forgery Detection. Our key contributions are twofold. First, we present EFF++, a frame-level, explainability-driven extension of the widely used FaceForensics++ (FF++) dataset. In EFF++, each manipulated video frame is paired with a textual annotation that describes both the forgery artifacts and the specific manipulation technique applied, enabling more effective and informative MLLM training. Second, we design a Vision-Language Fusion Network (VLF-Net) that promotes bidirectional interaction between visual and textual features, supported by a three-stage training pipeline to fully leverage its potential. VLF-FFD achieves state-of-the-art (SOTA) performance in both cross-dataset and intra-dataset evaluations, underscoring its exceptional effectiveness in face forgery detection.

Key findings

VLF-FFD achieves state-of-the-art performance on cross-dataset and intra-dataset evaluations across multiple benchmarks. Ablation studies confirm the effectiveness of the EFF++ dataset, the MLLM integration, and the VLFM architecture. The results demonstrate the superior generalization capability and discriminative power of the proposed method.

Approach

VLF-FFD uses a three-stage training pipeline. First, an external detector (ConvNeXt) is trained. Second, an MLLM (LLaVA) is fine-tuned using a new dataset (EFF++) with textual annotations describing forgery artifacts. Finally, a Vision-Language Fusion Module (VLFM) integrates visual and textual features for final classification.

Datasets

FaceForensics++ (FF++), EFF++ (an extension of FF++), Celeb-DeepFake-v2 (CDF2), DeepFake Detection Challenge (DFDC), DFDC Preview (DFDCP), FFIW-10K (FFIW)

Model(s)

ConvNeXt (external detector), LLaVA (MLLM with Mistral-7B), Vision-Language Fusion Module (VLFM)

Author countries

China, China

← Previous