A Multimodal Deviation Perceiving Framework for Weakly-Supervised Temporal Forgery Localization

Authors: Wenbo Xu, Junyan Wu, Wei Lu, Xiangyang Luo, Qian Wang

Published: 2025-07-22 13:55:16+00:00

AI Summary

This research introduces a multimodal deviation perceiving framework (MDP) for weakly-supervised temporal forgery localization in deepfakes. MDP utilizes a novel multimodal interaction mechanism and an extensible deviation perceiving loss to identify forged segments using only video-level annotations, achieving results comparable to fully-supervised approaches.

Abstract

Current researches on Deepfake forensics often treat detection as a classification task or temporal forgery localization problem, which are usually restrictive, time-consuming, and challenging to scale for large datasets. To resolve these issues, we present a multimodal deviation perceiving framework for weakly-supervised temporal forgery localization (MDP), which aims to identify temporal partial forged segments using only video-level annotations. The MDP proposes a novel multimodal interaction mechanism (MI) and an extensible deviation perceiving loss to perceive multimodal deviation, which achieves the refined start and end timestamps localization of forged segments. Specifically, MI introduces a temporal property preserving cross-modal attention to measure the relevance between the visual and audio modalities in the probabilistic embedding space. It could identify the inter-modality deviation and construct comprehensive video features for temporal forgery localization. To explore further temporal deviation for weakly-supervised learning, an extensible deviation perceiving loss has been proposed, aiming at enlarging the deviation of adjacent segments of the forged samples and reducing that of genuine samples. Extensive experiments demonstrate the effectiveness of the proposed framework and achieve comparable results to fully-supervised approaches in several evaluation metrics.


Key findings
MDP achieves comparable performance to fully-supervised methods in temporal forgery localization, even with only video-level annotations. The proposed multimodal interaction and deviation perceiving loss significantly improve localization accuracy. Results on the challenging AV-Deepfake1M dataset demonstrate the robustness of the approach.
Approach
The approach uses pre-trained models to extract audio and visual features. A multimodal interaction mechanism aligns and integrates these features using cross-modal attention, preserving temporal properties. A deviation perceiving loss is applied to enhance the separation between genuine and forged segments based on inter-segment deviation.
Datasets
LAV-DF and AV-Deepfake1M
Model(s)
TSN and Wav2Vec (for LAV-DF); ResNet50 and Wav2Vec (for AV-Deepfake1M)
Author countries
China