Human Perception of Audio Deepfakes

Authors: Nicolas M. Müller, Karla Pizzi, Jennifer Williams

Published: 2021-07-20 09:19:42+00:00

AI Summary

This paper compares human and machine capabilities in detecting audio deepfakes through a gamified online experiment. Humans and a state-of-the-art AI algorithm showed similar strengths and weaknesses, struggling with certain types of attacks, contrary to AI's superhuman performance in other areas. The study analyzes human success factors, such as native language and age.

Abstract

The recent emergence of deepfakes has brought manipulated and generated content to the forefront of machine learning research. Automatic detection of deepfakes has seen many new machine learning techniques, however, human detection capabilities are far less explored. In this paper, we present results from comparing the abilities of humans and machines for detecting audio deepfakes used to imitate someone's voice. For this, we use a web-based application framework formulated as a game. Participants were asked to distinguish between real and fake audio samples. In our experiment, 472 unique users competed against a state-of-the-art AI deepfake detection algorithm for 14912 total of rounds of the game. We find that humans and deepfake detection algorithms share similar strengths and weaknesses, both struggling to detect certain types of attacks. This is in contrast to the superhuman performance of AI in many application areas such as object detection or face recognition. Concerning human success factors, we find that IT professionals have no advantage over non-professionals but native speakers have an advantage over non-native speakers. Additionally, we find that older participants tend to be more susceptible than younger ones. These insights may be helpful when designing future cybersecurity training for humans as well as developing better detection algorithms.


Key findings
Humans and the AI algorithm performed similarly in realistic scenarios, both struggling with specific types of attacks. Native English speakers performed better than non-native speakers, while IT experience didn't affect detection rates. Older participants were more susceptible to audio deepfakes.
Approach
The researchers developed a web-based game where participants competed against a state-of-the-art AI deepfake detection algorithm (RawNet2) to identify real and fake audio samples. The study analyzed human performance and compared it to the AI's, considering factors like native language, IT experience, and age.
Datasets
ASVspoof Challenge 2019 dataset (train and eval splits)
Model(s)
Three-layer bidirectional LSTM and RawNet2
Author countries
Germany, United Kingdom