MVFNet: Multipurpose Video Forensics Network using Multiple Forms of Forensic Evidence

Authors: Tai D. Nguyen, Matthew C. Stamm

Published: 2025-03-26 21:11:06+00:00

AI Summary

MVFNet is a multipurpose video forensics network that detects multiple manipulation types (deepfakes, inpainting, splicing, editing) by jointly analyzing various forensic feature modalities (spatial and temporal residuals, optical flow residuals). It uses a novel Multi-Scale Hierarchical Transformer to identify inconsistencies across spatial scales, achieving state-of-the-art performance in general scenarios and rivaling specialized detectors in targeted scenarios.

Abstract

While videos can be falsified in many different ways, most existing forensic networks are specialized to detect only a single manipulation type (e.g. deepfake, inpainting). This poses a significant issue as the manipulation used to falsify a video is not known a priori. To address this problem, we propose MVFNet - a multipurpose video forensics network capable of detecting multiple types of manipulations including inpainting, deepfakes, splicing, and editing. Our network does this by extracting and jointly analyzing a broad set of forensic feature modalities that capture both spatial and temporal anomalies in falsified videos. To reliably detect and localize fake content of all shapes and sizes, our network employs a novel Multi-Scale Hierarchical Transformer module to identify forensic inconsistencies across multiple spatial scales. Experimental results show that our network obtains state-of-the-art performance in general scenarios where multiple different manipulations are possible, and rivals specialized detectors in targeted scenarios.


Key findings
MVFNet achieves state-of-the-art performance in detecting multiple video manipulations. It rivals specialized detectors in single-manipulation scenarios and generalizes well to out-of-distribution data and unseen manipulation techniques. The ablation study confirms the importance of all network components, particularly the multi-scale hierarchical transformer.
Approach
MVFNet extracts multiple forensic feature modalities from videos, including novel temporal forensic residuals and optical flow residuals. These features, along with spatial residuals and RGB context, are jointly analyzed using a multi-scale hierarchical transformer module to detect and localize fake content regardless of size or shape.
Datasets
Unified Video Forgery Analysis (UVFA) dataset (combining VideoFACT, FaceForensics++, DeepFakeDetection, DEVIL datasets), DAVIS dataset, VideoSham dataset
Model(s)
Multipurpose Video Forensics Network (MVFNet) with a Multi-Scale Hierarchical Transformer module, constrained convolutional layers, fused inverted residual (FIR) blocks.
Author countries
USA