Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition

Authors: Maheswar Bora, Tashvik Dhamija, Shukesh Reddy, Baptiste Chopin, Pranav Balaji, Abhijit Das, Antitza Dantcheva

Published: 2025-11-27 13:30:59+00:00

AI Summary

This paper proposes FauxNet, a novel network for generalizable deepfake detection based on pre-trained Visual Speech Recognition (VSR) features extracted from videos. FauxNet consistently outperforms state-of-the-art methods in zero-shot deepfake detection and can also attribute the specific generation technique used. The authors introduce new datasets, Authentica-Vox and Authentica-HDTF, comprising about 38,000 real and fake videos for extensive evaluation.

Abstract

Deepfake generation has witnessed remarkable progress, contributing to highly realistic generated images, videos, and audio. While technically intriguing, such progress has raised serious concerns related to the misuse of manipulated media. To mitigate such misuse, robust and reliable deepfake detection is urgently needed. Towards this, we propose a novel network FauxNet, which is based on pre-trained Visual Speech Recognition (VSR) features. By extracting temporal VSR features from videos, we identify and segregate real videos from manipulated ones. The holy grail in this context has to do with zero-shot detection, i.e., generalizable detection, which we focus on in this work. FauxNet consistently outperforms the state-of-the-art in this setting. In addition, FauxNet is able to attribute - distinguish between generation techniques from which the videos stem. Finally, we propose new datasets, referred to as Authentica-Vox and Authentica-HDTF, comprising about 38,000 real and fake videos in total, the latter created with six recent deepfake generation techniques. We provide extensive analysis and results on the Authentica datasets and FaceForensics++, demonstrating the superiority of FauxNet. The Authentica datasets will be made publicly available.


Key findings
FauxNet demonstrates superior performance in zero-shot deepfake detection, significantly outperforming existing state-of-the-art methods when faced with unseen manipulation techniques. The model is also capable of accurately attributing (classifying) different deepfake generation techniques, which is a novel contribution. The underlying VSR encoder effectively produces distinct and separable feature clusters for real and various fake video types, indicating its robustness for deepfake analysis.
Approach
FauxNet operates by extracting temporal VSR features from the cropped lip regions of input videos using a pre-trained VSR encoder (Auto-AVSR). These features are then aggregated via average pooling into a single video embedding. This embedding is subsequently processed by a multi-task learning framework comprising a common MLP and two linear heads: one for binary deepfake detection (real/fake) and another for classifying the deepfake generation technique.
Datasets
Authentica-Vox, Authentica-HDTF, FaceForensics++
Model(s)
FauxNet, which leverages a pre-trained Visual Speech Recognition (VSR) encoder (specifically Auto-AVSR) as its feature extractor, combined with a Multi-Layer Perceptron (MLP) that includes a BinaryHead for deepfake detection and a MultiHead for deepfake generation technique classification. The paper focuses on video deepfake detection, not audio deepfake detection.
Author countries
India, France