DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio Cross-Attention and Facial Self-Attention
Authors: Aaditya Kharel, Manas Paranjape, Aniket Bera
Published: 2023-09-12 18:37:05+00:00
AI Summary
DF-TransFusion introduces a novel multi-modal audio-video framework for deepfake detection that concurrently processes audio and video inputs. It leverages lip-audio cross-attention for synchronization cues and facial self-attention on visual features extracted by a fine-tuned VGG-16 network. The proposed method achieves state-of-the-art performance, outperforming existing multi-modal techniques in F-1 and AUC scores.
Abstract
With the rise in manipulated media, deepfake detection has become an imperative task for preserving the authenticity of digital content. In this paper, we present a novel multi-modal audio-video framework designed to concurrently process audio and video inputs for deepfake detection tasks. Our model capitalizes on lip synchronization with input audio through a cross-attention mechanism while extracting visual cues via a fine-tuned VGG-16 network. Subsequently, a transformer encoder network is employed to perform facial self-attention. We conduct multiple ablation studies highlighting different strengths of our approach. Our multi-modal methodology outperforms state-of-the-art multi-modal deepfake detection techniques in terms of F-1 and per-video AUC scores.