When Deepfake Detection Meets Graph Neural Network:a Unified and Lightweight Learning Framework

View on arXiv ← Back to list

Authors: Haoyu Liu, Chaoyu Gong, Mengke He, Jiate Li, Kai Han, Siqiang Luo

Published: 2025-08-07 15:55:13+00:00

AI Summary

This paper introduces SSTGNN, a lightweight deepfake detection framework using Spatial-Spectral-Temporal Graph Neural Networks. SSTGNN represents videos as structured graphs, enabling joint reasoning across spatial, temporal, and spectral domains, achieving superior performance with significantly fewer parameters than state-of-the-art models.

Abstract

The proliferation of generative video models has made detecting AI-generated and manipulated videos an urgent challenge. Existing detection approaches often fail to generalize across diverse manipulation types due to their reliance on isolated spatial, temporal, or spectral information, and typically require large models to perform well. This paper introduces SSTGNN, a lightweight Spatial-Spectral-Temporal Graph Neural Network framework that represents videos as structured graphs, enabling joint reasoning over spatial inconsistencies, temporal artifacts, and spectral distortions. SSTGNN incorporates learnable spectral filters and temporal differential modeling into a graph-based architecture, capturing subtle manipulation traces more effectively. Extensive experiments on diverse benchmark datasets demonstrate that SSTGNN not only achieves superior performance in both in-domain and cross-domain settings, but also offers strong robustness against unseen manipulations. Remarkably, SSTGNN accomplishes these results with up to 42.4$times$ fewer parameters than state-of-the-art models, making it highly lightweight and scalable for real-world deployment.

Key findings

SSTGNN achieves state-of-the-art performance in both in-domain and cross-domain deepfake detection. It significantly outperforms existing methods while using up to 42.4 times fewer parameters. Ablation studies confirm the contribution of each component to the model's performance.

Approach

SSTGNN represents videos as spatiotemporal graphs where nodes are image patches and edges encode spatial similarity and temporal differences. Learnable spectral filters are applied to extract frequency-domain anomalies, and negative edges capture inconsistencies. A graph attention network processes the graph representation for classification.

Datasets

FF++, Wild-DF, CD-v1, CD-v2, DFDC, SEINE, SVD, Pika, OpenSora, ZeroScope, Crafter, Gen2, Lavie, MSR-VTT, Youku-mPLUG

Model(s)

Spatial-Spectral-Temporal Graph Neural Network (SSTGNN) with a modified ResNet-50 backbone and Graph Attention Networks (GATs)

Author countries

Singapore, USA, Hong Kong SAR

← Previous