Seeing, Hearing, and Knowing Together: Multimodal Strategies in Deepfake Videos Detection

Authors: Chen Chen, Dion Hoe-Lian Goh

Published: 2026-02-01 15:29:56+00:00

AI Summary

This paper investigates human strategies for detecting deepfake videos, focusing on visual, audio, and knowledge-based cues. A study with 195 participants revealed that people are more accurate with real videos than deepfakes and are less calibrated (overconfident) for deepfake content. The research identifies effective cue combinations, particularly highlighting the importance of multimodal approaches in human deepfake detection.

Abstract

As deepfake videos become increasingly difficult for people to recognise, understanding the strategies humans use is key to designing effective media literacy interventions. We conducted a study with 195 participants between the ages of 21 and 40, who judged real and deepfake videos, rated their confidence, and reported the cues they relied on across visual, audio, and knowledge strategies. Participants were more accurate with real videos than with deepfakes and showed lower expected calibration error for real content. Through association rule mining, we identified cue combinations that shaped performance. Visual appearance, vocal, and intuition often co-occurred for successful identifications, which highlights the importance of multimodal approaches in human detection. Our findings show which cues help or hinder detection and suggest directions for designing media literacy tools that guide effective cue use. Building on these insights can help people improve their identification skills and become more resilient to deceptive digital media.


Key findings
Participants were more accurate at identifying real videos than deepfakes, exhibiting lower confidence calibration for deepfakes. While multimodal strategies improved accuracy and calibration for real videos, they showed little benefit for deepfake detection, suggesting deepfakes disrupt these strategies. Visual appearance and intuition were effective for deepfake detection, while audio cues (vocal, language) were more diagnostic for real videos, emphasizing the asymmetry in human detection capabilities.
Approach
The researchers conducted an empirical study with 195 human participants who judged real and deepfake videos, rated their confidence, and self-reported the visual, audio, and knowledge-based cues they used. Association rule mining was then employed to identify frequent co-occurrences of cues and their impact on detection performance and calibration.
Datasets
A custom stimulus set of 20 videos (10 authentic, 10 deepfake) collected from publicly accessible sources like YouTube, featuring public figures delivering spoken messages. (Specific examples like Mark Zuckerberg, Kim Kardashian, Tom Cruise, Manoj Tiwari, Jeremy Corbyn, Obama, Morgan Freeman, Jose Mourinho, Hillary Clinton, Ellen, Biden, Trump, President Uhuru are mentioned).
Model(s)
UNKNOWN
Author countries
Singapore