FauForensics: Boosting Audio-Visual Deepfake Detection with Facial Action Units

Authors: Jian Wang, Baoyuan Wu, Li Liu, Qingshan Liu

Published: 2025-05-13 07:18:07+00:00

AI Summary

FauForensics is a novel audio-visual deepfake detection framework that leverages biologically invariant facial action units (FAUs) as forgery-resistant representations. It computes frame-wise audio-visual similarities via a fusion module with learnable cross-modal queries, achieving state-of-the-art performance and superior cross-dataset generalizability.

Abstract

The rapid evolution of generative AI has increased the threat of realistic audio-visual deepfakes, demanding robust detection methods. Existing solutions primarily address unimodal (audio or visual) forgeries but struggle with multimodal manipulations due to inadequate handling of heterogeneous modality features and poor generalization across datasets. To this end, we propose a novel framework called FauForensics by introducing biologically invariant facial action units (FAUs), which is a quantitative descriptor of facial muscle activity linked to emotion physiology. It serves as forgery-resistant representations that reduce domain dependency while capturing subtle dynamics often disrupted in synthetic content. Besides, instead of comparing entire video clips as in prior works, our method computes fine-grained frame-wise audiovisual similarities via a dedicated fusion module augmented with learnable cross-modal queries. It dynamically aligns temporal-spatial lip-audio relationships while mitigating multi-modal feature heterogeneity issues. Experiments on FakeAVCeleb and LAV-DF show state-of-the-art (SOTA) performance and superior cross-dataset generalizability with up to an average of 4.83% than existing methods.


Key findings
FauForensics achieves state-of-the-art performance on FakeAVCeleb and LAV-DF datasets. It shows superior cross-dataset generalizability, outperforming existing methods by an average of 4.83%. The framework demonstrates robustness against various post-processing perturbations.
Approach
FauForensics uses FAUs to create forgery-resistant representations, reducing domain dependency. It employs a frame-wise fusion module with learnable cross-modal queries to dynamically align temporal-spatial lip-audio relationships, mitigating multimodal feature heterogeneity issues.
Datasets
FakeAVCeleb, LAV-DF, DISFA
Model(s)
A custom framework with CSN (video encoder), ME-GraphAU (FAU encoder), and Whisper (audio encoder), along with a multimodal transformer and MLP classifiers.
Author countries
China