Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization

Authors: Vinaya Sree Katamneni, Ajita Rattani

Published: 2024-08-02 18:45:01+00:00

AI Summary

This paper introduces MMMS-BA, a novel multi-modal attention framework using recurrent neural networks for audio-visual deepfake detection and localization. It leverages contextual information across multiple sequences and modalities to improve accuracy and precision over existing methods, achieving state-of-the-art performance.

Abstract

In the digital age, the emergence of deepfakes and synthetic media presents a significant threat to societal and political integrity. Deepfakes based on multi-modal manipulation, such as audio-visual, are more realistic and pose a greater threat. Current multi-modal deepfake detectors are often based on the attention-based fusion of heterogeneous data streams from multiple modalities. However, the heterogeneous nature of the data (such as audio and visual signals) creates a distributional modality gap and poses a significant challenge in effective fusion and hence multi-modal deepfake detection. In this paper, we propose a novel multi-modal attention framework based on recurrent neural networks (RNNs) that leverages contextual information for audio-visual deepfake detection. The proposed approach applies attention to multi-modal multi-sequence representations and learns the contributing features among them for deepfake detection and localization. Thorough experimental validations on audio-visual deepfake datasets, namely FakeAVCeleb, AV-Deepfake1M, TVIL, and LAV-DF datasets, demonstrate the efficacy of our approach. Cross-comparison with the published studies demonstrates superior performance of our approach with an improved accuracy and precision by 3.47% and 2.05% in deepfake detection and localization, respectively. Thus, obtaining state-of-the-art performance. To facilitate reproducibility, the code and the datasets information is available at https://github.com/vcbsl/audiovisual-deepfake/.


Key findings
MMMS-BA outperforms existing methods on multiple datasets, achieving state-of-the-art performance in both deepfake detection (AUC and ACC improvements of 3.47% and 2.05% respectively) and localization. Ablation studies demonstrate the importance of considering both cross-modal and cross-sequence contextual information.
Approach
MMMS-BA uses bidirectional GRUs to process audio, full face, and lip region sequences. It then applies a bi-modal attention mechanism across modality pairs (audio-visual, audio-lip, visual-lip) and a multi-sequence attention mechanism to leverage contextual information. Classification and regression heads are used for detection and localization, respectively.
Datasets
FakeAVCeleb, AV-Deepfake1M, TVIL, and LAV-DF datasets
Model(s)
Multi-Modal Multi-Sequence Bi-Modal Attention (MMMS-BA) framework using Bidirectional GRUs
Author countries
USA