Towards Generalizable Deepfake Detection via Forgery-aware Audio-Visual Adaptation: A Variational Bayesian Approach

Authors: Fan Nie, Jiangqun Ni, Jian Zhang, Bin Zhang, Weizhe Zhang, Bin Li

Published: 2025-11-24 13:20:03+00:00

Comment: TIFS AQE

AI Summary

This paper introduces Forgery-aware Audio-Visual Adaptation with Variational Bayes (FoVB), a novel framework for multi-modal deepfake detection. FoVB reformulates audio-visual correlation learning using variational Bayesian estimation, approximating correlation as a Gaussian latent variable. The method leverages forgery-aware features and factorizes latent variables to disentangle intra-modal and cross-modal forgery traces, achieving superior generalizability and robustness.

Abstract

The widespread application of AIGC contents has brought not only unprecedented opportunities, but also potential security concerns, e.g., audio-visual deepfakes. Therefore, it is of great importance to develop an effective and generalizable method for multi-modal deepfake detection. Typically, the audio-visual correlation learning could expose subtle cross-modal inconsistencies, e.g., audio-visual misalignment, which serve as crucial clues in deepfake detection. In this paper, we reformulate the correlation learning with variational Bayesian estimation, where audio-visual correlation is approximated as a Gaussian distributed latent variable, and thus develop a novel framework for deepfake detection, i.e., Forgery-aware Audio-Visual Adaptation with Variational Bayes (FoVB). Specifically, given the prior knowledge of pre-trained backbones, we adopt two core designs to estimate audio-visual correlations effectively. First, we exploit various difference convolutions and a high-pass filter to discern local and global forgery traces from both modalities. Second, with the extracted forgery-aware features, we estimate the latent Gaussian variable of audio-visual correlation via variational Bayes. Then, we factorize the variable into modality-specific and correlation-specific ones with orthogonality constraint, allowing them to better learn intra-modal and cross-modal forgery traces with less entanglement. Extensive experiments demonstrate that our FoVB outperforms other state-of-the-art methods in various benchmarks.


Key findings
The proposed FoVB framework consistently outperforms other state-of-the-art methods in terms of intra-dataset performance, cross-manipulation generalizability, cross-dataset generalizability, and robustness to unseen perturbations. It demonstrates superior generalizability by effectively identifying both intra-modal and cross-modal forgery contributions through factorized latent variables and an orthogonality constraint. The method achieves promising robustness against diverse audio-visual perturbations, making it suitable for real-world scenarios.
Approach
The FoVB framework tackles multi-modal deepfake detection by reformulating audio-visual correlation learning using variational Bayesian estimation, where correlations are modeled as Gaussian latent variables. It employs Global-Local Forgery-aware Adaptation (GLFA) with difference convolutions and high-pass filters to extract intra-modal forgery features. Subsequently, Variational Bayesian Forgery Estimation (VBFE) is used to factorize these latent variables into modality-specific and correlation-specific components with an orthogonality constraint, allowing for better disentanglement and learning of both intra-modal and cross-modal forgery traces.
Datasets
FakeAVCeleb, KoDF, DeAVMiT, DFDC, LAV-DF, IDForge
Model(s)
Vision-Transformer (ViT), Transformer blocks (for encoders and backbone adaptation)
Author countries
China