Weakly Supervised Multimodal Temporal Forgery Localization via Multitask Learning

View on arXiv ← Back to list

Authors: Wenbo Xu, Wei Lu, Xiangyang Luo

Published: 2025-08-04 08:22:39+00:00

AI Summary

This paper introduces WMMT, a weakly supervised multimodal temporal forgery localization model using multitask learning. WMMT achieves fine-grained deepfake detection and temporal localization using only video-level annotations by integrating visual and audio modality detection as binary classification tasks within a multimodal framework.

Abstract

The spread of Deepfake videos has caused a trust crisis and impaired social stability. Although numerous approaches have been proposed to address the challenges of Deepfake detection and localization, there is still a lack of systematic research on the weakly supervised multimodal fine-grained temporal forgery localization (WS-MTFL). In this paper, we propose a novel weakly supervised multimodal temporal forgery localization via multitask learning (WMMT), which addresses the WS-MTFL under the multitask learning paradigm. WMMT achieves multimodal fine-grained Deepfake detection and temporal partial forgery localization using merely video-level annotations. Specifically, visual and audio modality detection are formulated as two binary classification tasks. The multitask learning paradigm is introduced to integrate these tasks into a multimodal task. Furthermore, WMMT utilizes a Mixture-of-Experts structure to adaptively select appropriate features and localization head, achieving excellent flexibility and localization precision in WS-MTFL. A feature enhancement module with temporal property preserving attention mechanism is proposed to identify the intra- and inter-modality feature deviation and construct comprehensive video features. To further explore the temporal information for weakly supervised learning, an extensible deviation perceiving loss has been proposed, which aims to enlarge the deviation of adjacent segments of the forged samples and reduce the deviation of genuine samples. Extensive experiments demonstrate the effectiveness of multitask learning for WS-MTFL, and the WMMT achieves comparable results to fully supervised approaches in several evaluation metrics.

Key findings

WMMT achieves comparable results to fully supervised approaches in several evaluation metrics, demonstrating the effectiveness of multitask learning for weakly supervised multimodal temporal forgery localization. The model shows good generalization capability across datasets. The ablation study highlights the contribution of each module, particularly multitask learning and feature enhancement.

Approach

WMMT uses a multitask learning paradigm to integrate visual and audio modality detection as binary classification tasks. It employs a Mixture-of-Experts structure for adaptive feature and localization head selection, and a feature enhancement module with a temporal property preserving attention mechanism to improve localization precision.

Datasets

LAV-DF and AV-Deepfake1M

Model(s)

TSN (for visual features), Wav2Vec (for audio features), a custom multimodal architecture with Mixture-of-Experts and a temporal property preserving attention mechanism.

Author countries

China

← Previous