DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio Cross-Attention and Facial Self-Attention

Authors: Aaditya Kharel, Manas Paranjape, Aniket Bera

Published: 2023-09-12 18:37:05+00:00

AI Summary

DF-TransFusion introduces a novel multi-modal audio-video framework for deepfake detection that concurrently processes audio and video inputs. It leverages lip-audio cross-attention for synchronization cues and facial self-attention on visual features extracted by a fine-tuned VGG-16 network. The proposed method achieves state-of-the-art performance, outperforming existing multi-modal techniques in F-1 and AUC scores.

Abstract

With the rise in manipulated media, deepfake detection has become an imperative task for preserving the authenticity of digital content. In this paper, we present a novel multi-modal audio-video framework designed to concurrently process audio and video inputs for deepfake detection tasks. Our model capitalizes on lip synchronization with input audio through a cross-attention mechanism while extracting visual cues via a fine-tuned VGG-16 network. Subsequently, a transformer encoder network is employed to perform facial self-attention. We conduct multiple ablation studies highlighting different strengths of our approach. Our multi-modal methodology outperforms state-of-the-art multi-modal deepfake detection techniques in terms of F-1 and per-video AUC scores.


Key findings
The DF-TransFusion model achieved state-of-the-art performance on multi-modal deepfake detection, significantly outperforming existing baseline methods with AUC scores of 0.979 on DFDC and 1.000 on DF-TIMIT. Ablation studies confirmed the crucial role of both lip-audio cross-attention and facial self-attention components for the model's superior efficacy. The approach also showed promising results on in-the-wild deepfake videos.
Approach
The approach extracts facial regions using MTCNN and visual cues via a fine-tuned VGG-16 network, followed by a transformer encoder for facial self-attention. Simultaneously, lip regions are extracted and processed with raw audio through an audio transformer encoder, utilizing a cross-attention mechanism for lip-audio synchronization analysis. The output embeddings from both transformer branches are then concatenated and fed into a Multi-Layer Perceptron (MLP) head for final classification.
Datasets
DFDC, DF-TIMIT, FakeAVCeleb
Model(s)
VGG-16 (fine-tuned), Transformer Encoders (with Multi-head Self-Attention and Cross-Attention), MTCNN, Multi-Layer Perceptron (MLP) head
Author countries
USA