Towards Generalizable Deepfake Detection with Spatial-Frequency Collaborative Learning and Hierarchical Cross-Modal Fusion

Authors: Mengyu Qiao, Runze Tian, Yang Wang

Published: 2025-04-24 03:23:35+00:00

AI Summary

This paper presents a novel deepfake detection framework that integrates multi-scale spatial-frequency analysis for improved accuracy and generalizability. It achieves this by combining block-wise Discrete Cosine Transform (DCT) with convolutional neural networks and a hierarchical cross-modal fusion mechanism to effectively model spatial-frequency interactions.

Abstract

The rapid evolution of deep generative models poses a critical challenge to deepfake detection, as detectors trained on forgery-specific artifacts often suffer significant performance degradation when encountering unseen forgeries. While existing methods predominantly rely on spatial domain analysis, frequency domain operations are primarily limited to feature-level augmentation, leaving frequency-native artifacts and spatial-frequency interactions insufficiently exploited. To address this limitation, we propose a novel detection framework that integrates multi-scale spatial-frequency analysis for universal deepfake detection. Our framework comprises three key components: (1) a local spectral feature extraction pipeline that combines block-wise discrete cosine transform with cascaded multi-scale convolutions to capture subtle spectral artifacts; (2) a global spectral feature extraction pipeline utilizing scale-invariant differential accumulation to identify holistic forgery distribution patterns; and (3) a multi-stage cross-modal fusion mechanism that incorporates shallow-layer attention enhancement and deep-layer dynamic modulation to model spatial-frequency interactions. Extensive evaluations on widely adopted benchmarks demonstrate that our method outperforms state-of-the-art deepfake detection methods in both accuracy and generalizability.


Key findings
The proposed method outperforms state-of-the-art deepfake detection methods in accuracy and generalizability across multiple datasets. Ablation studies confirm the contributions of each component of the framework. The method shows robustness to different compression levels and generalizes well across different deepfake datasets.
Approach
The proposed method uses a three-component framework: local spectral feature extraction (block-wise DCT with multi-scale convolutions), global spectral feature extraction (scale-invariant differential accumulation), and a multi-stage cross-modal fusion mechanism (shallow-layer attention enhancement and deep-layer dynamic modulation) to detect deepfakes.
Datasets
FaceForensics++ (FF++) (including HQ, LQ versions), Celeb-DF (v2), DFDC
Model(s)
EfficientNet-B4, custom convolutional neural networks within the spatial and frequency pipelines, a multi-head attention mechanism within the cross-modal fusion module
Author countries
China, USA