Deepfake Detection that Generalizes Across Benchmarks

View on arXiv ← Back to list

Authors: Andrii Yermakov, Jan Cech, Jiri Matas, Mario Fritz

Published: 2025-08-08 12:03:56+00:00

AI Summary

This paper presents LNCLIP-DF, a parameter-efficient deepfake detection method that achieves state-of-the-art generalization across multiple benchmarks. It fine-tunes only the Layer Normalization parameters of a pre-trained CLIP vision encoder and uses L2 normalization and latent space augmentations to enhance generalization.

Abstract

The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment. Although many approaches adapt foundation models by introducing significant architectural complexity, this work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of a pre-trained CLIP vision encoder. The proposed method, LNCLIP-DF, fine-tunes only the Layer Normalization parameters (0.03% of the total) and enhances generalization by enforcing a hyperspherical feature manifold using L2 normalization and latent space augmentations. We conducted an extensive evaluation on 13 benchmark datasets spanning from 2019 to 2025. The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC. Our analysis yields two primary findings for the field: 1) training on paired real-fake data from the same source video is essential for mitigating shortcut learning and improving generalization, and 2) detection difficulty on academic datasets has not strictly increased over time, with models trained on older, diverse datasets showing strong generalization capabilities. This work delivers a computationally efficient and reproducible method, proving that state-of-the-art generalization is attainable by making targeted, minimal changes to a pre-trained CLIP model. The code will be made publicly available upon acceptance.

Key findings

LNCLIP-DF outperforms more complex deepfake detection methods in average cross-dataset AUROC. Training on paired real-fake data from the same source video is crucial for mitigating shortcut learning and improving generalization. Detection difficulty on academic datasets hasn't consistently increased over time; models trained on older, diverse datasets generalize well.

Approach

LNCLIP-DF adapts a pre-trained CLIP vision encoder by fine-tuning only its Layer Normalization parameters (0.03% of total parameters). It enhances generalization through L2 normalization of features and latent space augmentations using slerp interpolation. Video-level classification is achieved by averaging frame-level predictions.

Datasets

FaceForensics++ (FF++), Celeb-DF-v2 (CDFv2), DeepFake Detection Challenge (DFDC), Google's DFD dataset, Face Forensics in the Wild (FFIW), DeepSpeak v1.1 (DSv1) and DeepSpeak v2.0 (DSv2), Korean DeepFake Detection Dataset (KoDF), FakeAVCeleb (FAVC), DeepFakes from Different Models (DFDM), PolyGlotFake (PGF), and IDForge (IDF).

Model(s)

Pre-trained CLIP ViT-L/14 vision encoder with added L2 normalization, slerp augmentation, and a linear classifier. Only Layer Normalization parameters are fine-tuned.

Author countries

Czech Republic, Germany

← Previous