Multiscale Adaptive Conflict-Balancing Model For Multimedia Deepfake Detection

Authors: Zihan Xiong, Xiaohua Wu, Lei Chen, Fangqi Lou

Published: 2025-05-19 11:01:49+00:00

AI Summary

This paper proposes MACB-DF, an audio-visual joint learning method for deepfake detection that addresses modality imbalance and gradient conflicts. MACB-DF uses contrastive learning for multi-level cross-modal fusion and an orthogonalization-multimodal Pareto module to resolve gradient conflicts, achieving an average accuracy of 95.5% across multiple datasets and showing superior cross-dataset generalization.

Abstract

Advances in computer vision and deep learning have blurred the line between deepfakes and authentic media, undermining multimedia credibility through audio-visual forgery. Current multimodal detection methods remain limited by unbalanced learning between modalities. To tackle this issue, we propose an Audio-Visual Joint Learning Method (MACB-DF) to better mitigate modality conflicts and neglect by leveraging contrastive learning to assist in multi-level and cross-modal fusion, thereby fully balancing and exploiting information from each modality. Additionally, we designed an orthogonalization-multimodal pareto module that preserves unimodal information while addressing gradient conflicts in audio-video encoders caused by differing optimization targets of the loss functions. Extensive experiments and ablation studies conducted on mainstream deepfake datasets demonstrate consistent performance gains of our model across key evaluation metrics, achieving an average accuracy of 95.5% across multiple datasets. Notably, our method exhibits superior cross-dataset generalization capabilities, with absolute improvements of 8.0% and 7.7% in ACC scores over the previous best-performing approach when trained on DFDC and tested on DefakeAVMiT and FakeAVCeleb datasets.


Key findings
MACB-DF achieves state-of-the-art performance on multiple deepfake detection datasets, with an average accuracy of 95.5%. It shows superior cross-dataset generalization capabilities, significantly outperforming previous approaches. Ablation studies confirm the effectiveness of contrastive learning and the Pareto optimization module.
Approach
MACB-DF uses contrastive learning to balance audio and video information during fusion, incorporating an orthogonalization-multimodal Pareto module to resolve gradient conflicts between unimodal and multimodal processes. Multi-scale feature extraction is employed to capture both global and local information.
Datasets
DefakeAVMiT, FakeAVCeleb, DFDC, LAV-DF
Model(s)
UNKNOWN (The paper describes a novel architecture, MACB-DF, but doesn't specify pre-trained models used as building blocks)
Author countries
China