Toward Generalized Detection of Synthetic Media: Limitations, Challenges, and the Path to Multimodal Solutions

Authors: Redwan Hussain, Mizanur Rahman, Prithwiraj Bhattacharjee

Published: 2025-11-14 09:44:44+00:00

Comment: 10 Pages, 4 figures, 1 table, 7th International Conference on Trends in Computational and Cognitive Engineering(TCCE-2025)

AI Summary

This study reviews twenty-four recent works on AI-generated media detection, analyzing their contributions and weaknesses. It identifies common limitations and challenges in current unimodal approaches, particularly regarding generalization, varied generative models, and multimodal content. The paper suggests a research direction focusing on multimodal deep learning models to achieve more robust and generalized detection of synthetic media.

Abstract

Artificial intelligence (AI) in media has advanced rapidly over the last decade. The introduction of Generative Adversarial Networks (GANs) improved the quality of photorealistic image generation. Diffusion models later brought a new era of generative media. These advances made it difficult to separate real and synthetic content. The rise of deepfakes demonstrated how these tools could be misused to spread misinformation, political conspiracies, privacy violations, and fraud. For this reason, many detection models have been developed. They often use deep learning methods such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). These models search for visual, spatial, or temporal anomalies. However, such approaches often fail to generalize across unseen data and struggle with content from different models. In addition, existing approaches are ineffective in multimodal data and highly modified content. This study reviews twenty-four recent works on AI-generated media detection. Each study was examined individually to identify its contributions and weaknesses, respectively. The review then summarizes the common limitations and key challenges faced by current approaches. Based on this analysis, a research direction is suggested with a focus on multimodal deep learning models. Such models have the potential to provide more robust and generalized detection. It offers future researchers a clear starting point for building stronger defenses against harmful synthetic media.


Key findings
Current AI-generated media detection models exhibit poor generalization across unseen data and struggle with content from different generative models. Traditional unimodal approaches are insufficient for subtle or highly modified multimodal content, indicating the need for a shift toward multimodal deep learning models. Developing generalized multimodal detection systems is identified as the most promising path for future research to combat evolving synthetic media.
Approach
The authors conduct a comprehensive literature review of twenty-four recent studies on AI-generated media detection. They individually analyze each paper to identify its contributions, methods, results, and limitations, then summarize the overarching challenges and propose future research directions.
Datasets
The paper itself is a review and does not use datasets for its own experiments. However, it discusses numerous datasets utilized in the reviewed literature, including UADFV, FF++, Celeb-DF v2, So-Fake-Set, GenBuster-200k, VID-AID, VidProM, GenVidBench, Physics-IQ, FakeAVCeleb, Panda70M, Youtube-8M, Sports-1M, FloreView, Socrates, GVD, ForgeryNet, DVF, COCO, LSUN, MIT, Video-ACID, GVF, TMC Dataset, CIFAKE, ML Olympiad Competition, ASVspoof 2019 Evaluation, and DFDC.
Model(s)
The paper is a literature review and does not implement new models. It discusses various models and architectures employed in the reviewed works, such as Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), Xception, DenseNet, Pyramid Flow, ResNet50, ResNet34, CLIP:ViT, BERT, LSTM, LLaVA, DINOv2, VideoMAE, SigLIP, Qwen2.5 Instruct, Sophia Thinking, LoRA, and R(2+1)D-18 CNN.
Author countries
Bangladesh