Human Detection of Political Speech Deepfakes across Transcripts, Audio, and Video

Authors: Matthew Groh, Aruna Sankaranarayanan, Nikhil Singh, Dong Young Kim, Andrew Lippman, Rosalind Picard

Published: 2022-02-25 18:47:32+00:00

AI Summary

This paper investigates human ability to detect political deepfakes across various media modalities, audio sources, and misinformation base rates through five randomized experiments with 2,215 participants. It reveals that audio-visual cues significantly improve human discernment compared to text alone, and deepfakes generated with state-of-the-art text-to-speech algorithms are harder to detect than those with voice actor audio. The findings suggest human discernment relies more on how something is said (audio-visual cues) than what is said (speech content).

Abstract

Recent advances in technology for hyper-realistic visual and audio effects provoke the concern that deepfake videos of political speeches will soon be indistinguishable from authentic video recordings. The conventional wisdom in communication theory predicts people will fall for fake news more often when the same version of a story is presented as a video versus text. We conduct 5 pre-registered randomized experiments with 2,215 participants to evaluate how accurately humans distinguish real political speeches from fabrications across base rates of misinformation, audio sources, question framings, and media modalities. We find base rates of misinformation minimally influence discernment and deepfakes with audio produced by the state-of-the-art text-to-speech algorithms are harder to discern than the same deepfakes with voice actor audio. Moreover across all experiments, we find audio and visual information enables more accurate discernment than text alone: human discernment relies more on how something is said, the audio-visual cues, than what is said, the speech content.


Key findings
Human discernment of political deepfakes is significantly more accurate when presented with audio and visual information compared to text alone. Deepfakes with audio produced by state-of-the-art text-to-speech algorithms are harder for humans to detect than those with voice actor audio. Additionally, the base rate of misinformation minimally influences overall discernment accuracy.
Approach
The authors conducted five pre-registered randomized experiments involving 2,215 participants to assess human discernment of real versus deepfake political speeches. These experiments systematically varied media modalities (text, audio, silent video, video with audio, and combinations), audio sources (voice actor vs. text-to-speech), base rates of fabricated content, and question framings.
Datasets
Presidential Deepfake Dataset (PDD), videos from Barari et al 2021
Model(s)
UNKNOWN
Author countries
USA