WaveFake: A Data Set to Facilitate Audio Deepfake Detection

Authors: Joel Frank, Lea Schönherr

Published: 2021-11-04 12:26:34+00:00

AI Summary

This paper introduces WaveFake, a novel dataset for audio deepfake detection, comprising samples from six state-of-the-art text-to-speech (TTS) architectures across two languages. It also provides two baseline models (GMM and RawNet2) for future research in this area.

Abstract

Deep generative modeling has the potential to cause significant harm to society. Recognizing this threat, a magnitude of research into detecting so-called Deepfakes has emerged. This research most often focuses on the image domain, while studies exploring generated audio signals have, so-far, been neglected. In this paper we make three key contributions to narrow this gap. First, we provide researchers with an introduction to common signal processing techniques used for analyzing audio signals. Second, we present a novel data set, for which we collected nine sample sets from five different network architectures, spanning two languages. Finally, we supply practitioners with two baseline models, adopted from the signal processing community, to facilitate further research in this area.


Key findings
GMM classifiers showed more robustness than RawNet2 models across different test conditions. Subtle differences were found between audio generated by different TTS architectures, particularly in higher frequencies. The best-performing models exhibited trade-offs between in-distribution and out-of-distribution performance.
Approach
The authors created a dataset of audio samples generated by various TTS models. They then used Gaussian Mixture Models (GMMs) and RawNet2, a CNN-GRU hybrid model, as baseline classifiers to evaluate the ability to distinguish real from fake audio, analyzing performance across different datasets and settings.
Datasets
LJSPEECH, JSUT, Common Voices
Model(s)
Gaussian Mixture Model (GMM), RawNet2
Author countries
Germany