Creation and Detection of German Voice Deepfakes

View on arXiv ← Back to list

Authors: Vanessa Barnekow, Dominik Binder, Niclas Kromrey, Pascal Munaretto, Andreas Schaad, Felix Schmieder

Published: 2021-08-02 06:17:25+00:00

AI Summary

This paper investigates the feasibility of creating and detecting German voice deepfakes using readily available tools and datasets. The authors demonstrate that convincing deepfakes can be generated with relatively little effort, and that human detection rates are low (37%), while a bispectral analysis-based approach achieves higher detection accuracy.

Abstract

Synthesizing voice with the help of machine learning techniques has made rapid progress over the last years [1] and first high profile fraud cases have been recently reported [2]. Given the current increase in using conferencing tools for online teaching, we question just how easy (i.e. needed data, hardware, skill set) it would be to create a convincing voice fake. We analyse how much training data a participant (e.g. a student) would actually need to fake another participants voice (e.g. a professor). We provide an analysis of the existing state of the art in creating voice deep fakes, as well as offer detailed technical guidance and evidence of just how much effort is needed to copy a voice. A user study with more than 100 participants shows how difficult it is to identify real and fake voice (on avg. only 37 percent can distinguish between real and fake voice of a professor). With a focus on German language and an online teaching environment we discuss the societal implications as well as demonstrate how to use machine learning techniques to possibly detect such fakes.

Key findings

Convincing German voice deepfakes are easily created with limited data (3 hours) and resources. Human detection accuracy is low (37%). A bispectral analysis method shows promise for automated deepfake detection, achieving up to 80% precision.

Approach

The authors used the Tacotron 2 model for voice synthesis, training it on datasets of the German Chancellor Angela Merkel and a university professor. They evaluated the generated deepfakes through a user study and explored a bispectral analysis-based approach for detection.

Datasets

German M-AILABS dataset (Angela Merkel's speeches), a custom dataset of a university professor's voice recordings from online lectures.

Model(s)

Tacotron 2 (with WaveGlow vocoder), pretrained models from NVIDIA.

Author countries

Germany

← Previous