Towards generalizing deep-audio fake detection networks

View on arXiv ← Back to list

Authors: Konstantin Gasenzer, Moritz Wolter

Published: 2023-05-22 13:37:52+00:00

AI Summary

This paper addresses the limited generalization ability of deep audio fake detectors to unseen generators by identifying stable frequency domain fingerprints of various audio generators. Using these fingerprints, the authors train lightweight, generalizing detectors that achieve improved results on the WaveFake dataset and its extended version.

Abstract

Today's generative neural networks allow the creation of high-quality synthetic speech at scale. While we welcome the creative use of this new technology, we must also recognize the risks. As synthetic speech is abused for monetary and identity theft, we require a broad set of deepfake identification tools. Furthermore, previous work reported a limited ability of deep classifiers to generalize to unseen audio generators. We study the frequency domain fingerprints of current audio generators. Building on top of the discovered frequency footprints, we train excellent lightweight detectors that generalize. We report improved results on the WaveFake dataset and an extended version. To account for the rapid progress in the field, we extend the WaveFake dataset by additionally considering samples drawn from the novel Avocodo and BigVGAN networks. For illustration purposes, the supplementary material contains audio samples of generator artifacts.

Key findings

The DCNN demonstrates robust generalization to unseen audio generators, outperforming previous models like LCNN, AST, and GMM on the extended WaveFake dataset. High-frequency components are shown to be particularly significant for deepfake detection, as revealed by integrated gradients attribution analysis. The DCNN achieves this with significantly fewer parameters than competing models.

Approach

The authors analyze frequency domain fingerprints (using Wavelet Packet Transform and Short-Time Fourier Transform) of several audio generators. Based on these discovered artifacts, they train a Dilated Convolutional Neural Network (DCNN) for audio deepfake detection, demonstrating improved generalization to unseen generators compared to previous methods.

Datasets

WaveFake dataset (extended with Avocodo and BigVGAN networks), LJSpeech, JSUT

Model(s)

Dilated Convolutional Neural Network (DCNN), Light Convolutional Neural Network (LCNN), Audio Spectrogram Transformer (AST), Gaussian Mixture Model (GMM), RawNet2

Author countries

Germany

← Previous