MMNet: Multi-Collaboration and Multi-Supervision Network for Sequential Deepfake Detection

Authors: Ruiyang Xia, Decheng Liu, Jie Li, Lin Yuan, Nannan Wang, Xinbo Gao

Published: 2023-07-06 02:32:08+00:00

AI Summary

MMNet, a novel network for sequential deepfake detection, addresses the challenges of handling various spatial and sequential manipulations in forged face images. It achieves independent recovery without knowledge of the manipulation method and introduces a new evaluation metric, Complete Sequence Matching (CSM), to assess detection accuracy across multiple inferring steps.

Abstract

Advanced manipulation techniques have provided criminals with opportunities to make social panic or gain illicit profits through the generation of deceptive media, such as forged face images. In response, various deepfake detection methods have been proposed to assess image authenticity. Sequential deepfake detection, which is an extension of deepfake detection, aims to identify forged facial regions with the correct sequence for recovery. Nonetheless, due to the different combinations of spatial and sequential manipulations, forged face images exhibit substantial discrepancies that severely impact detection performance. Additionally, the recovery of forged images requires knowledge of the manipulation model to implement inverse transformations, which is difficult to ascertain as relevant techniques are often concealed by attackers. To address these issues, we propose Multi-Collaboration and Multi-Supervision Network (MMNet) that handles various spatial scales and sequential permutations in forged face images and achieve recovery without requiring knowledge of the corresponding manipulation method. Furthermore, existing evaluation metrics only consider detection accuracy at a single inferring step, without accounting for the matching degree with ground-truth under continuous multiple steps. To overcome this limitation, we propose a novel evaluation metric called Complete Sequence Matching (CSM), which considers the detection accuracy at multiple inferring steps, reflecting the ability to detect integrally forged sequences. Extensive experiments on several typical datasets demonstrate that MMNet achieves state-of-the-art detection performance and independent recovery performance.


Key findings
MMNet achieves state-of-the-art detection performance and independent recovery performance on the datasets used. The proposed CSM metric provides a more comprehensive evaluation of sequential deepfake detection compared to existing metrics. The model also shows high accuracy in binary deepfake detection.
Approach
MMNet uses a multi-collaboration module with multiple detection branches to handle different spatial scales and a multi-supervision module with pixel-level supervision to improve localization. A sequential inference process is modeled using masked attention, and a novel CSM metric evaluates the complete predicted sequence.
Datasets
Facial components and attributes manipulation datasets (35,166 and 49,920 images, respectively), FFHQ dataset (for SRM pre-training).
Model(s)
ResNet50 as backbone, Transformer-based architecture with self-attention and masked attention mechanisms, PixelStylePixel (SRM).
Author countries
China