Study of detecting behavioral signatures within DeepFake videos

Authors: Qiaomu Miao, Sinhwa Kang, Stacy Marsella, Steve DiPaola, Chao Wang, Ari Shapiro

Published: 2022-08-06 18:30:53+00:00

Comment: 9 pages

AI Summary

This paper investigates whether human behavioral signatures, specifically movements, can be distinguished from a person's visual appearance in synthetic videos to detect deepfakes. The authors conduct a user study comparing synthetic videos generated by transferring behavior signals from different sources (different person/utterance, same person/different utterance, different person/same utterance) to a target speaker's appearance. Their findings indicate that synthetic videos, in all cases, are perceived as less real and engaging than original videos, suggesting the detectability of a behavioral signature separate from visual appearance.

Abstract

There is strong interest in the generation of synthetic video imagery of people talking for various purposes, including entertainment, communication, training, and advertisement. With the development of deep fake generation models, synthetic video imagery will soon be visually indistinguishable to the naked eye from a naturally capture video. In addition, many methods are continuing to improve to avoid more careful, forensic visual analysis. Some deep fake videos are produced through the use of facial puppetry, which directly controls the head and face of the synthetic image through the movements of the actor, allow the actor to 'puppet' the image of another. In this paper, we address the question of whether one person's movements can be distinguished from the original speaker by controlling the visual appearance of the speaker but transferring the behavior signals from another source. We conduct a study by comparing synthetic imagery that: 1) originates from a different person speaking a different utterance, 2) originates from the same person speaking a different utterance, and 3) originates from a different person speaking the same utterance. Our study shows that synthetic videos in all three cases are seen as less real and less engaging than the original source video. Our results indicate that there could be a behavioral signature that is detectable from a person's movements that is separate from their visual appearance, and that this behavioral signature could be used to distinguish a deep fake from a properly captured video.


Key findings
The study found that synthetic videos, regardless of whether the behavior was from a different person/utterance, same person/different utterance, or different person/same utterance, were consistently rated as less real and engaging by human participants compared to original videos. This suggests the existence of a detectable behavioral signature, distinct from visual appearance, that can be used for deepfake detection. Mouth movements, facial expressions, and head movements were identified as key contributing factors to these judgments, with their importance varying based on the type of behavioral transfer.
Approach
The authors conducted a user study where they generated deepfake videos of a target person (Donald Trump) by transferring behavioral movements from various sources using Wav2lip for lipsyncing and FOMM for face reenactment. Participants evaluated these synthetic videos against reconstructed original videos based on naturalness and engagement, with different tests isolating the effects of behavior style and utterance.
Datasets
Custom-generated videos using clips from Donald Trump's interviews/debates and videos of other celebrities (Tom Cruise, Barack Obama, Taylor Swift, Emma Watson) as source material.
Model(s)
Wav2lip, First Order Motion Model (FOMM)
Author countries
USA, Canada