HFMF: Hierarchical Fusion Meets Multi-Stream Models for Deepfake Detection

Authors: Anant Mehta, Bryant McArthur, Nagarjuna Kolloju, Zhengzhong Tu

Published: 2025-01-10 00:20:29+00:00

AI Summary

The paper introduces HFMF, a two-stage deepfake detection framework that uses hierarchical cross-modal feature fusion and multi-stream feature extraction. This approach combines Vision Transformers, convolutional neural networks, and other specialized models to achieve superior performance across diverse datasets.

Abstract

The rapid progress in deep generative models has led to the creation of incredibly realistic synthetic images that are becoming increasingly difficult to distinguish from real-world data. The widespread use of Variational Models, Diffusion Models, and Generative Adversarial Networks has made it easier to generate convincing fake images and videos, which poses significant challenges for detecting and mitigating the spread of misinformation. As a result, developing effective methods for detecting AI-generated fakes has become a pressing concern. In our research, we propose HFMF, a comprehensive two-stage deepfake detection framework that leverages both hierarchical cross-modal feature fusion and multi-stream feature extraction to enhance detection performance against imagery produced by state-of-the-art generative AI models. The first component of our approach integrates vision Transformers and convolutional nets through a hierarchical feature fusion mechanism. The second component of our framework combines object-level information and a fine-tuned convolutional net model. We then fuse the outputs from both components via an ensemble deep neural net, enabling robust classification performances. We demonstrate that our architecture achieves superior performance across diverse dataset benchmarks while maintaining calibration and interoperability.


Key findings
HFMF achieved significantly higher accuracy than prior methods on the WildRF dataset across different social media platforms. On the CollabDif dataset, it achieved near-perfect accuracy. Ablation studies confirmed the effectiveness of the hierarchical fusion and multi-stream approach.
Approach
HFMF employs a two-stage process. The first stage uses hierarchical feature fusion of ViT and ResNet features. The second stage extracts multi-stream features (YOLOv8 for objects, Sobel for edges, Xception for fine-grained features). These stages' outputs are ensembled for final classification.
Datasets
WildRF (images from Facebook, Reddit, and X), CollabDif
Model(s)
Vision Transformer (ViT-Base-16), ResNet50, YOLOv8, Sobel edge detection, XceptionNet, Ensemble Deep Neural Network
Author countries
USA