A Multimodal Deviation Perceiving Framework for Weakly-Supervised Temporal Forgery Localization

Authors: Wenbo Xu, Junyan Wu, Wei Lu, Xiangyang Luo, Qian Wang

Published: 2025-07-22 13:55:16+00:00

Comment: 9 pages, 3 figures,conference

AI Summary

This paper introduces a Multimodal Deviation Perceiving framework (MDP) for weakly-supervised temporal forgery localization, which identifies forged segments in videos using only video-level annotations. MDP employs a novel multimodal interaction mechanism (MI) with temporal property preserving cross-modal attention to detect inter-modality deviations and an extensible deviation perceiving loss to enhance temporal deviation perception between adjacent segments. Extensive experiments show MDP achieves comparable performance to fully-supervised methods in localizing forged segments.

Abstract

Current researches on Deepfake forensics often treat detection as a classification task or temporal forgery localization problem, which are usually restrictive, time-consuming, and challenging to scale for large datasets. To resolve these issues, we present a multimodal deviation perceiving framework for weakly-supervised temporal forgery localization (MDP), which aims to identify temporal partial forged segments using only video-level annotations. The MDP proposes a novel multimodal interaction mechanism (MI) and an extensible deviation perceiving loss to perceive multimodal deviation, which achieves the refined start and end timestamps localization of forged segments. Specifically, MI introduces a temporal property preserving cross-modal attention to measure the relevance between the visual and audio modalities in the probabilistic embedding space. It could identify the inter-modality deviation and construct comprehensive video features for temporal forgery localization. To explore further temporal deviation for weakly-supervised learning, an extensible deviation perceiving loss has been proposed, aiming at enlarging the deviation of adjacent segments of the forged samples and reducing that of genuine samples. Extensive experiments demonstrate the effectiveness of the proposed framework and achieve comparable results to fully-supervised approaches in several evaluation metrics.


Key findings
MDP significantly outperforms other weakly-supervised temporal action localization approaches on both LAV-DF and AV-Deepfake1M datasets. It achieves comparable results to fully-supervised approaches in several evaluation metrics, especially at lower IoU thresholds (AP@0.5, AP@0.75 on LAV-DF; AP@0.1, AP@0.2 on AV-Deepfake1M). The cross-modal attention and deviation perceiving loss are both crucial components, with their combined use yielding the best performance.
Approach
The MDP framework extracts visual and audio features using pre-trained models, then aligns them in temporal and spatial dimensions. A temporal property preserving cross-modal attention mechanism (MI) is used to perceive inter-modality deviations and construct comprehensive video features. An extensible deviation perceiving loss is proposed to enlarge the deviation of adjacent forged segments and reduce that of genuine samples, enabling weakly-supervised temporal forgery localization.
Datasets
LAV-DF, AV-Deepfake1M
Model(s)
Feature Extractors: TSN (visual), ResNet50 (visual), Wav2Vec (audio). The proposed framework is MDP, incorporating a Multimodal Interaction Mechanism and Deviation Perceiving Loss.
Author countries
China