VoD: Learning Volume of Differences for Video-Based Deepfake Detection

Authors: Ying Xu, Marius Pedersen, Kiran Raja

Published: 2025-03-10 17:59:38+00:00

AI Summary

This paper presents VoD, a novel deepfake detection framework that leverages temporal and spatial inconsistencies between consecutive video frames to improve detection accuracy. VoD employs a progressive learning approach using consecutive frame differences (CFD) and a stepwise expansion network to capture differences across multiple axes.

Abstract

The rapid development of deep learning and generative AI technologies has profoundly transformed the digital contact landscape, creating realistic Deepfake that poses substantial challenges to public trust and digital media integrity. This paper introduces a novel Deepfake detention framework, Volume of Differences (VoD), designed to enhance detection accuracy by exploiting temporal and spatial inconsistencies between consecutive video frames. VoD employs a progressive learning approach that captures differences across multiple axes through the use of consecutive frame differences (CFD) and a network with stepwise expansions. We evaluate our approach with intra-dataset and cross-dataset testing scenarios on various well-known Deepfake datasets. Our findings demonstrate that VoD excels with the data it has been trained on and shows strong adaptability to novel, unseen data. Additionally, comprehensive ablation studies examine various configurations of segment length, sampling steps, and intervals, offering valuable insights for optimizing the framework. The code for our VoD framework is available at https://github.com/xuyingzhongguo/VoD.


Key findings
VoD outperforms state-of-the-art methods on intra-dataset testing with FaceForensics++. It shows strong adaptability to unseen datasets, although performance varies across different datasets. Ablation studies reveal optimal configurations for segment length, sampling steps, and intervals for enhanced performance.
Approach
VoD extracts consecutive frame differences (CFD) to highlight temporal and spatial inconsistencies. These CFD segments are then processed by a stepwise expansion network (X3D) to learn subtle differences along multiple axes (x, y, t) before classification.
Datasets
FaceForensics++ (FF++), Celeb-DF (v2), DFDC, DeepFakeDetection (DFD)
Model(s)
Expandable 3D Network (X3D), specifically X3D-S, and Slow r50 are also used in ablation studies.
Author countries
China, Norway