Cross-Modality and Within-Modality Regularization for Audio-Visual DeepFake Detection

Authors: Heqing Zou, Meng Shen, Yuchen Hu, Chen Chen, Eng Siong Chng, Deepu Rajan

Published: 2024-01-11 08:52:13+00:00

AI Summary

This paper proposes a novel audio-visual deepfake detection method using cross-modality and within-modality regularization to improve the robustness of multimodal representation learning. The approach incorporates an audio-visual transformer module and regularization modules to preserve modality distinctions while aligning paired audio-visual signals.

Abstract

Audio-visual deepfake detection scrutinizes manipulations in public video using complementary multimodal cues. Current methods, which train on fused multimodal data for multimodal targets face challenges due to uncertainties and inconsistencies in learned representations caused by independent modality manipulations in deepfake videos. To address this, we propose cross-modality and within-modality regularization to preserve modality distinctions during multimodal representation learning. Our approach includes an audio-visual transformer module for modality correspondence and a cross-modality regularization module to align paired audio-visual signals, preserving modality distinctions. Simultaneously, a within-modality regularization module refines unimodal representations with modality-specific targets to retain modal-specific details. Experimental results on the public audio-visual dataset, FakeAVCeleb, demonstrate the effectiveness and competitiveness of our approach.


Key findings
The proposed method, MRDF, achieves state-of-the-art performance on the FakeAVCeleb dataset, with an accuracy of 94.05% and AUC of 92.43%. Ablation studies show the effectiveness of both cross-modality and within-modality regularization in improving detection accuracy, particularly for deepfakes with only one manipulated modality. The approach mitigates misclassification issues observed in baseline methods.
Approach
The authors address inconsistencies in audio-visual deepfake detection by using cross-modality regularization to align paired audio-visual signals while preserving modality distinctions. Within-modality regularization refines unimodal representations using modality-specific targets. An audio-visual transformer module enhances correspondence between audio and visual features.
Datasets
FakeAVCeleb
Model(s)
ResNet-18 (modified) for audio and visual encoders; audio-visual transformer module with 12 transformer blocks
Author countries
Singapore