FAME: A Lightweight Spatio-Temporal Network for Model Attribution of Face-Swap Deepfakes

Authors: Wasim Ahmad, Yan-Tsung Peng, Yuan-Hao Chang

Published: 2025-06-13 05:47:09+00:00

AI Summary

FAME is a lightweight spatio-temporal network for Deepfake model attribution, a task that determines which generative model created a given Deepfake. It integrates spatial and temporal attention mechanisms to improve attribution accuracy while maintaining computational efficiency, outperforming existing methods on three datasets.

Abstract

The widespread emergence of face-swap Deepfake videos poses growing risks to digital security, privacy, and media integrity, necessitating effective forensic tools for identifying the source of such manipulations. Although most prior research has focused primarily on binary Deepfake detection, the task of model attribution -- determining which generative model produced a given Deepfake -- remains underexplored. In this paper, we introduce FAME (Fake Attribution via Multilevel Embeddings), a lightweight and efficient spatio-temporal framework designed to capture subtle generative artifacts specific to different face-swap models. FAME integrates spatial and temporal attention mechanisms to improve attribution accuracy while remaining computationally efficient. We evaluate our model on three challenging and diverse datasets: Deepfake Detection and Manipulation (DFDM), FaceForensics++, and FakeAVCeleb. Results show that FAME consistently outperforms existing methods in both accuracy and runtime, highlighting its potential for deployment in real-world forensic and information security applications.


Key findings
FAME consistently outperforms existing methods in both accuracy and runtime across three datasets. It achieves high accuracy even with lower-resolution input frames (112x112), demonstrating its efficiency and suitability for real-world forensic applications. The model shows strong generalization across various manipulation techniques and compression levels.
Approach
FAME uses a truncated VGG-19 network for spatial feature extraction, followed by an attention-enhanced bidirectional LSTM for temporal encoding. Spatial and temporal attention mechanisms focus on subtle generative artifacts, and a weighted feature aggregation produces a video-level representation for model attribution.
Datasets
Deepfake Detection and Manipulation (DFDM), FaceForensics++, FakeAVCeleb (visual modality only)
Model(s)
Truncated VGG-19, Bidirectional LSTM with attention mechanisms, fully connected layer for classification
Author countries
Taiwan