Explicit Correlation Learning for Generalizable Cross-Modal Deepfake Detection

View on arXiv ← Back to list

Authors: Cai Yu, Shan Jia, Xiaomeng Fu, Jin Liu, Jiahe Tian, Jiao Dai, Xi Wang, Siwei Lyu, Jizhong Han

Published: 2024-04-30 00:25:44+00:00

AI Summary

This paper proposes a novel deepfake detection method that explicitly learns cross-modal correlations between audio and video content to improve generalizability across various deepfake generation techniques. It introduces a correlation distillation task using ASR and VSR models as teacher models and a new benchmark dataset, CMDFD, containing diverse cross-modal deepfakes.

Abstract

With the rising prevalence of deepfakes, there is a growing interest in developing generalizable detection methods for various types of deepfakes. While effective in their specific modalities, traditional detection methods fall short in addressing the generalizability of detection across diverse cross-modal deepfakes. This paper aims to explicitly learn potential cross-modal correlation to enhance deepfake detection towards various generation scenarios. Our approach introduces a correlation distillation task, which models the inherent cross-modal correlation based on content information. This strategy helps to prevent the model from overfitting merely to audio-visual synchronization. Additionally, we present the Cross-Modal Deepfake Dataset (CMDFD), a comprehensive dataset with four generation methods to evaluate the detection of diverse cross-modal deepfakes. The experimental results on CMDFD and FakeAVCeleb datasets demonstrate the superior generalizability of our method over existing state-of-the-art methods. Our code and data can be found at url{https://github.com/ljj898/CMDFD-Dataset-and-Deepfake-Detection}.

Key findings

Experimental results on CMDFD and FakeAVCeleb datasets demonstrate the superior generalizability of the proposed method compared to state-of-the-art methods. The method achieves high AUC scores across various deepfake types, including those generated with the subject's own voice, showcasing its robustness. Ablation studies confirm the importance of both the distillation and contrastive learning components.

Approach

The approach uses a multi-task learning framework with two branches: a detection branch for deepfake prediction and a distillation branch for learning cross-modal correlation. The distillation branch uses ASR and VSR models to provide soft labels for audio-visual correlation based on content, preventing overfitting to simple audio-visual synchronization.

Datasets

CMDFD (Cross-Modal Deepfake Dataset), FakeAVCeleb

Model(s)

2D ResNet34 (audio encoder), visual frontend and visual temporal network (visual encoder), cross-attention modules, fully connected layer for detection. ASR and VSR models are used as teacher models in the distillation branch.

Author countries

China, USA

← Previous