Wavelet-Driven Generalizable Framework for Deepfake Face Forgery Detection

Authors: Lalith Bharadwaj Baru, Rohit Boddeda, Shilhora Akshay Patel, Sai Mohan Gajapaka

Published: 2024-09-26 21:16:51+00:00

AI Summary

Wavelet-CLIP, a deepfake detection framework, integrates wavelet transforms with CLIP-pretrained ViT-L/14 features to analyze spatial and frequency image features, improving deepfake detection. It achieves state-of-the-art performance in cross-dataset generalization and robustness against unseen deepfakes generated by diffusion models.

Abstract

The evolution of digital image manipulation, particularly with the advancement of deep generative models, significantly challenges existing deepfake detection methods, especially when the origin of the deepfake is obscure. To tackle the increasing complexity of these forgeries, we propose textbf{Wavelet-CLIP}, a deepfake detection framework that integrates wavelet transforms with features derived from the ViT-L/14 architecture, pre-trained in the CLIP fashion. Wavelet-CLIP utilizes Wavelet Transforms to deeply analyze both spatial and frequency features from images, thus enhancing the model's capability to detect sophisticated deepfakes. To verify the effectiveness of our approach, we conducted extensive evaluations against existing state-of-the-art methods for cross-dataset generalization and detection of unseen images generated by standard diffusion models. Our method showcases outstanding performance, achieving an average AUC of 0.749 for cross-data generalization and 0.893 for robustness against unseen deepfakes, outperforming all compared methods. The code can be reproduced from the repo: url{https://github.com/lalithbharadwajbaru/Wavelet-CLIP}


Key findings
Wavelet-CLIP outperforms state-of-the-art methods in cross-dataset generalization, achieving an average AUC of 0.749. It also shows superior robustness against unseen deepfakes generated by diffusion models, with an AUC of 0.893. The integration of wavelet transforms significantly improves performance compared to using only CLIP features.
Approach
The approach uses a pre-trained Vision Transformer (ViT-L/14) from CLIP to extract image features. These features are then processed using Discrete Wavelet Transforms (DWT) to separate low and high-frequency components, with the low-frequency components further refined by an MLP before reconstruction via Inverse DWT and final classification by another MLP.
Datasets
FaceForensics++ c23 (training), Celeb-DF v1, Celeb-DF v2, FaceShifter, and 50,000 images generated by DDPM, DDIM, and LDM diffusion models (testing)
Model(s)
ViT-L/14 (pre-trained with CLIP weights), Multilayer Perceptrons (MLPs)
Author countries
India, USA