Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization

Authors: Qilin Yin, Wei Lu, Xiangyang Luo, Xiaochun Cao

Published: 2025-06-10 06:40:43+00:00

AI Summary

This paper introduces UniCaCLF, a universal context-aware contrastive learning framework for temporal forgery localization (TFL). UniCaCLF uses supervised contrastive learning to identify forged segments in audio-visual content by detecting anomalies relative to a global context, enabling precise localization of manipulated clips.

Abstract

Most research efforts in the multimedia forensics domain have focused on detecting forgery audio-visual content and reached sound achievements. However, these works only consider deepfake detection as a classification task and ignore the case where partial segments of the video are tampered with. Temporal forgery localization (TFL) of small fake audio-visual clips embedded in real videos is still challenging and more in line with realistic application scenarios. To resolve this issue, we propose a universal context-aware contrastive learning framework (UniCaCLF) for TFL. Our approach leverages supervised contrastive learning to discover and identify forged instants by means of anomaly detection, allowing for the precise localization of temporal forged segments. To this end, we propose a novel context-aware perception layer that utilizes a heterogeneous activation operation and an adaptive context updater to construct a context-aware contrastive objective, which enhances the discriminability of forged instant features by contrasting them with genuine instant features in terms of their distances to the global context. An efficient context-aware contrastive coding is introduced to further push the limit of instant feature distinguishability between genuine and forged instants in a supervised sample-by-sample manner, suppressing the cross-sample influence to improve temporal forgery localization performance. Extensive experimental results over five public datasets demonstrate that our proposed UniCaCLF significantly outperforms the state-of-the-art competing algorithms.


Key findings
UniCaCLF significantly outperforms state-of-the-art methods on five public datasets across various forgery scenarios (multimodal, video-only, audio-only). It achieves superior performance in terms of Average Precision and Average Recall, demonstrating its effectiveness and generalizability. The proposed framework offers a favorable trade-off between accuracy and computational efficiency.
Approach
UniCaCLF leverages supervised contrastive learning to identify forged instants as anomalies relative to a global context. It employs a context-aware perception layer with heterogeneous activation and an adaptive context updater to enhance feature discriminability and a context-aware contrastive loss to maximize the difference between genuine and forged features within each sample.
Datasets
LAV-DF, AV-Deepfake1M, TVIL, HAD, Psynd
Model(s)
Two-stream TSN (for video), BYOL-A (for audio), ResNet50 (for video in AV-Deepfake1M), Wave2vec (for audio in AV-Deepfake1M)
Author countries
China