BMRL: Bi-Modal Guided Multi-Perspective Representation Learning for Zero-Shot Deepfake Attribution

Authors: Yaning Zhang, Jiahe Zhang, Chunjie Ma, Weili Guan, Tian Gan, Zan Gao

Published: 2025-04-19 01:11:46+00:00

AI Summary

This paper introduces BMRL, a bi-modal guided multi-perspective representation learning framework for zero-shot deepfake attribution. BMRL leverages image, noise, and edge features along with face parsing and text embeddings to improve the traceability of deepfakes generated by unseen models.

Abstract

The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen generators in a fine-grained manner. In this paper, we propose a novel bi-modal guided multi-perspective representation learning (BMRL) framework for zero-shot deepfake attribution (ZS-DFA), which facilitates effective traceability to unseen generators. Specifically, we design a multi-perspective visual encoder (MPVE) to explore general deepfake attribution visual characteristics across three views (i.e., image, noise, and edge). We devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via vision-parsing matching. A language encoder is proposed to capture fine-grained language embeddings, facilitating language-guided general visual forgery representation learning through vision-language alignment. Additionally, we present a novel deepfake attribution contrastive center (DFACC) loss, to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results demonstrate that our method outperforms the state-of-the-art on the ZS-DFA task through various protocols evaluation.


Key findings
BMRL significantly outperforms state-of-the-art methods in zero-shot deepfake attribution across various protocols. The inclusion of face parsing and language modalities, along with the DFACC loss, is crucial for improved performance. The model shows robustness to unseen image corruptions.
Approach
BMRL uses a multi-perspective visual encoder to extract features from image, noise, and edge views. It incorporates face parsing and language encoders for bi-modal information and employs a novel deepfake attribution contrastive center loss to enhance traceability to unseen generators.
Datasets
GenFace, DF40, CelebAHQ, FFHQ
Model(s)
Multi-perspective visual encoder (MPVE), Parsing encoder (PE), Language encoder (LE), MLP, Transformer blocks
Author countries
China