Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection

View on arXiv ← Back to list

Authors: Taehoon Kim, Jongwook Choi, Yonghyun Jeong, Haeun Noh, Jaejun Yoo, Seungryul Baek, Jongwon Choi

Published: 2025-07-03 07:49:55+00:00

AI Summary

This paper proposes a deepfake video detection method leveraging pixel-wise temporal inconsistencies often missed by traditional spatial frequency-based detectors. It achieves this by performing a 1D Fourier transform on each pixel's time axis and integrating these features with spatio-temporal context using a joint transformer module.

Abstract

We introduce a deepfake video detection approach that exploits pixel-wise temporal inconsistencies, which traditional spatial frequency-based detectors often overlook. Traditional detectors represent temporal information merely by stacking spatial frequency spectra across frames, resulting in the failure to detect temporal artifacts in the pixel plane. Our approach performs a 1D Fourier transform on the time axis for each pixel, extracting features highly sensitive to temporal inconsistencies, especially in areas prone to unnatural movements. To precisely locate regions containing the temporal artifacts, we introduce an attention proposal module trained in an end-to-end manner. Additionally, our joint transformer module effectively integrates pixel-wise temporal frequency features with spatio-temporal context features, expanding the range of detectable forgery artifacts. Our framework represents a significant advancement in deepfake video detection, providing robust performance across diverse and challenging detection scenarios.

Key findings

The proposed method outperforms state-of-the-art methods on various unseen datasets, demonstrating strong generalization ability. The method shows robustness against common perturbations like blur and resizing. Ablation studies highlight the importance of both global and part-based temporal frequency features and the effectiveness of the attention proposal module.

Approach

The approach extracts pixel-wise temporal frequency features using a 1D Fourier transform on each pixel's time series. An attention proposal module identifies regions of interest containing temporal artifacts. A joint transformer module integrates these features with spatio-temporal context features for final classification.

Datasets

FaceForensics++ (FF++), Celeb-DF-v2 (CDF), DFDC-V2 (DFDC), FaceShifter (FSh), DeeperForensics-v1 (DFo), DeepFake Detection (DFD), Korean DeepFake Detection Dataset (KoDF)

Model(s)

2D ResNet, 3D ResNet-50, transformer encoder

Author countries

Korea

← Previous