SynthForensics: A Multi-Generator Benchmark for Detecting Synthetic Video Deepfakes

View on arXiv ← Back to list

Authors: Roberto Leotta, Salvatore Alfio Sambataro, Claudio Vittorio Ragaglia, Mirko Casu, Yuri Petralia, Francesco Guarnera, Luca Guarnera, Sebastiano Battiato

Published: 2026-02-04 16:47:37+00:00

AI Summary

This paper introduces SynthForensics, the first human-centric benchmark designed for detecting purely synthetic video deepfakes generated by modern text-to-video (T2V) models. Comprising 6,815 unique videos from five state-of-the-art open-source T2V models, the dataset is meticulously constructed with a two-stage human-in-the-loop validation and provided in four compression versions for robustness testing. Experiments reveal that existing deepfake detectors are fragile and exhibit limited generalization to this new domain, often performing worse than random chance under heavy compression.

Abstract

The landscape of synthetic media has been irrevocably altered by text-to-video (T2V) models, whose outputs are rapidly approaching indistinguishability from reality. Critically, this technology is no longer confined to large-scale labs; the proliferation of efficient, open-source generators is democratizing the ability to create high-fidelity synthetic content on consumer-grade hardware. This makes existing face-centric and manipulation-based benchmarks obsolete. To address this urgent threat, we introduce SynthForensics, to the best of our knowledge the first human-centric benchmark for detecting purely synthetic video deepfakes. The benchmark comprises 6,815 unique videos from five architecturally distinct, state-of-the-art open-source T2V models. Its construction was underpinned by a meticulous two-stage, human-in-the-loop validation to ensure high semantic and visual quality. Each video is provided in four versions (raw, lossless, light, and heavy compression) to enable real-world robustness testing. Experiments demonstrate that state-of-the-art detectors are both fragile and exhibit limited generalization when evaluated on this new domain: we observe a mean performance drop of $29.19\\%$ AUC, with some methods performing worse than random chance, and top models losing over 30 points under heavy compression. The paper further investigates the efficacy of training on SynthForensics as a means to mitigate these observed performance gaps, achieving robust generalization to unseen generators ($93.81\\%$ AUC), though at the cost of reduced backward compatibility with traditional manipulation-based deepfakes. The complete dataset and all generation metadata, including the specific prompts and inference parameters for every video, will be made publicly available at [link anonymized for review].

Key findings

State-of-the-art deepfake detectors exhibit a mean performance drop of 29.19% AUC on purely synthetic videos, with some performing worse than random chance and losing over 30 points under heavy compression. Training on SynthForensics can mitigate these gaps, achieving robust generalization to unseen generators (93.81% AUC) but at the cost of reduced backward compatibility with traditional manipulation-based deepfakes, revealing a fundamental incompatibility between these forensic domains.

Approach

The authors introduce a new human-centric benchmark, SynthForensics, for detecting purely synthetic video deepfakes. This benchmark is constructed using a paired-source protocol, generating 6,815 unique videos from five diverse open-source T2V models, derived from real source videos, and undergoing extensive human-in-the-loop validation for quality and ethical compliance. They then use this benchmark to evaluate state-of-the-art deepfake detection methods.

Datasets

SynthForensics, FaceForensics++ (FF++), Deep Fake Detection (DFD), Celeb-DF (CDF)

Model(s)

CFM, RECCE, ProDet, UCF, Effort, AltFreezing, FTCN, GenConViT, DFD-FCG

Author countries

Italy

← Previous