Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

Authors: Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, Amit K. Roy-Chowdhury

Published: 2024-12-16 19:00:19+00:00

AI Summary

The paper introduces UNITE, a universal synthetic video detector that addresses limitations of existing face-centric methods by analyzing full video frames. UNITE leverages a transformer-based architecture and an attention-diversity loss to improve detection of face and background manipulations, as well as fully AI-generated content.

Abstract

Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing. However, advancements in text-to-video (T2V) and image-to-video (I2V) generative models now allow fully AI-generated synthetic content and seamless background alterations, challenging face-centric detection methods and demanding more versatile approaches. To address this, we introduce the underline{U}niversal underline{N}etwork for underline{I}dentifying underline{T}ampered and synthunderline{E}tic videos (texttt{UNITE}) model, which, unlike traditional detectors, captures full-frame manipulations. texttt{UNITE} extends detection capabilities to scenarios without faces, non-human subjects, and complex background modifications. It leverages a transformer-based architecture that processes domain-agnostic features extracted from videos via the SigLIP-So400M foundation model. Given limited datasets encompassing both facial/background alterations and T2V/I2V content, we integrate task-irrelevant data alongside standard DeepFake datasets in training. We further mitigate the model's tendency to over-focus on faces by incorporating an attention-diversity (AD) loss, which promotes diverse spatial attention across video frames. Combining AD loss with cross-entropy improves detection performance across varied contexts. Comparative evaluations demonstrate that texttt{UNITE} outperforms state-of-the-art detectors on datasets (in cross-data settings) featuring face/background manipulations and fully synthetic T2V/I2V videos, showcasing its adaptability and generalizable detection capabilities.


Key findings
UNITE outperforms state-of-the-art detectors on various datasets, including those with face/background manipulations and fully synthetic videos, demonstrating adaptability and generalizable detection capabilities. The attention-diversity loss significantly improves performance, particularly on videos without facial manipulations. UNITE achieves high accuracy even on in-the-wild DeepFakes.
Approach
UNITE uses a transformer-based architecture processing domain-agnostic features extracted from videos via the SigLIP-So400M foundation model. It incorporates an attention-diversity loss to mitigate the model's tendency to over-focus on faces, improving detection across various manipulation types. Task-irrelevant data is integrated with standard DeepFake datasets during training.
Datasets
FaceForensics++ (FF++), CelebDF, DeeperForensics, Deepfake-TIMIT, HifiFace, UADFV, AVID, SAIL-VOS-3D, DeMamba, New York Times DeepFake quiz
Model(s)
Transformer-based architecture using SigLIP-So400M foundation model for feature extraction.
Author countries
USA