Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics

Authors: Tom Or, Omri Azencot

Published: 2025-08-01 17:07:00+00:00

AI Summary

This paper proposes using latent representations from intermediate layers of large pre-trained multi-modal models for deepfake detection. It demonstrates that linear classifiers trained on these features achieve state-of-the-art results across audio and image modalities, while being computationally efficient and effective in few-shot settings.

Abstract

Generative models achieve remarkable results in multiple data domains, including images and texts, among other examples. Unfortunately, malicious users exploit synthetic media for spreading misinformation and disseminating deepfakes. Consequently, the need for robust and stable fake detectors is pressing, especially when new generative models appear everyday. While the majority of existing work train classifiers that discriminate between real and fake information, such tools typically generalize only within the same family of generators and data modalities, yielding poor results on other generative classes and data domains. Towards a universal classifier, we propose the use of large pre-trained multi-modal models for the detection of generative content. Effectively, we show that the latent code of these models naturally captures information discriminating real from fake. Building on this observation, we demonstrate that linear classifiers trained on these features can achieve state-of-the-art results across various modalities, while remaining computationally efficient, fast to train, and effective even in few-shot settings. Our work primarily focuses on fake detection in audio and images, achieving performance that surpasses or matches that of strong baseline methods.


Key findings
The proposed method outperforms or matches state-of-the-art deepfake detection methods on both image and audio datasets. Intermediate layers of multi-modal models provide more effective features than initial or final layers for deepfake detection. The approach is computationally efficient and effective even with limited training data.
Approach
The approach leverages pre-trained multi-modal models (like CLIP-ViT and ImageBind). Instead of using the final layer, it extracts features from intermediate layers, hypothesizing these better capture discriminative information between real and fake content. A simple linear classifier (SVM or MLP) is then trained on these features.
Datasets
ProGAN, LSUN, ASVSpoof2019, In-the-Wild, various GAN and diffusion models (StyleGAN, BigGAN, CycleGAN, StarGAN, GauGAN, CRN, IMLE, SAN, SITD, DeepFakes, Guided, LDM, Glide, DALL-E, StyleGAN2, WhichFaceIsReal), and others.
Model(s)
CLIP-ViT, ImageBind, SVM, MLP
Author countries
Israel