ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

Authors: Kaede Shiohara, Toshihiko Yamasaki, Vladislav Golyanik

Published: 2026-01-05 18:59:54+00:00

Comment: 17 pages, 8 figures, 11 tables; project page: https://mapooon.github.io/ExposeAnyonePage/

AI Summary

ExposeAnyone introduces a fully self-supervised framework for face forgery detection based on personalized audio-to-expression diffusion models. The approach pre-trains a diffusion model to generate facial expressions from audio, which is then personalized to specific subjects using reference sets. Deepfakes are detected by computing identity distances via diffusion reconstruction errors, offering a novel person-of-interest face forgery detection mechanism that is robust and generalizable.

Abstract

Detecting unknown deepfake manipulations remains one of the most challenging problems in face forgery detection. Current state-of-the-art approaches fail to generalize to unseen manipulations, as they primarily rely on supervised training with existing deepfakes or pseudo-fakes, which leads to overfitting to specific forgery patterns. In contrast, self-supervised methods offer greater potential for generalization, but existing work struggles to learn discriminative representations only from self-supervision. In this paper, we propose ExposeAnyone, a fully self-supervised approach based on a diffusion model that generates expression sequences from audio. The key idea is, once the model is personalized to specific subjects using reference sets, it can compute the identity distances between suspected videos and personalized subjects via diffusion reconstruction errors, enabling person-of-interest face forgery detection. Extensive experiments demonstrate that 1) our method outperforms the previous state-of-the-art method by 4.22 percentage points in the average AUC on DF-TIMIT, DFDCP, KoDF, and IDForge datasets, 2) our model is also capable of detecting Sora2-generated videos, where the previous approaches perform poorly, and 3) our method is highly robust to corruptions such as blur and compression, highlighting the applicability in real-world face forgery detection.


Key findings
The method achieved an average AUC of 95.22% across DF-TIMIT, DFDCP, KoDF, and IDForge datasets, outperforming the previous state-of-the-art by 4.22 percentage points. It demonstrated superior performance in detecting Sora2-generated videos, where existing methods struggled. ExposeAnyone also proved highly robust to various corruptions like blur and compression, highlighting its applicability in real-world scenarios.
Approach
The method, ExposeAnyone, pre-trains an audio-to-expression diffusion model (EXAM) on a large, unlabeled video dataset to learn general facial dynamics. It then personalizes the pre-trained model to a specific subject using a subject-specific adapter and reference videos to capture their unique talking identity. Face forgery is detected by comparing diffusion reconstruction errors using a proposed content-agnostic authentication mechanism, which quantifies the discrepancy between an input video and the personalized subject's identity.
Datasets
VoxCeleb2, AVSpeech, Acappella (for pre-training); Deepfake-TIMIT (DF-TIMIT), Deepfake Detection Challenge Preview (DFDCP), Korean Deepfake Detection (KoDF), Identity-Driven Multimedia Forgery Detection (IDForge), and Sora2 Cameo Forensics Preview (S2CFP) (for evaluation).
Model(s)
ExposeAnyone Model (EXAM), based on Diffusion Transformer (DiT) with Time- and Feature-wise Linear Modulation (TiLM). It utilizes Wav2Vec 2.0 for audio encoding and FLAME 3D morphable model parameters for facial expression representation, extracted using SPECTRE with a refinement process. Personalization is achieved using learnable adapter tokens.
Author countries
Japan, Germany