Source Tracing of Audio Deepfake Systems

View on arXiv ← Back to list

Authors: Nicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, Elie Khoury

Published: 2024-07-10 19:49:10+00:00

AI Summary

This research introduces a system for classifying audio deepfake generation attributes (input type, acoustic model, vocoder) rather than simply detecting deepfakes. The system leverages existing spoofing countermeasure architectures and is evaluated on ASVspoof 2019 and MLAAD datasets, demonstrating robustness in identifying deepfake generation techniques.

Abstract

Recent progress in generative AI technology has made audio deepfakes remarkably more realistic. While current research on anti-spoofing systems primarily focuses on assessing whether a given audio sample is fake or genuine, there has been limited attention on discerning the specific techniques to create the audio deepfakes. Algorithms commonly used in audio deepfake generation, like text-to-speech (TTS) and voice conversion (VC), undergo distinct stages including input processing, acoustic modeling, and waveform generation. In this work, we introduce a system designed to classify various spoofing attributes, capturing the distinctive features of individual modules throughout the entire generation pipeline. We evaluate our system on two datasets: the ASVspoof 2019 Logical Access and the Multi-Language Audio Anti-Spoofing Dataset (MLAAD). Results from both experiments demonstrate the robustness of the system to identify the different spoofing attributes of deepfake generation systems.

Key findings

The end-to-end approach significantly outperformed a previous study on ASVspoof 2019 for acoustic and vocoder classification, achieving near-perfect accuracy on a new input-type classification task. On MLAAD, the ResNet end-to-end model showed strong performance, with higher accuracy for vocoder than acoustic model classification; acoustic models producing similar voices were harder to distinguish.

Approach

The paper proposes two approaches: an end-to-end method training standalone classifiers for each attribute and a two-stage method using pre-trained spoofing detection models' embeddings with an added classification head. Three state-of-the-art spoofing countermeasures (ResNet, self-supervised learning, Whisper) are used as base models.

Datasets

ASVspoof 2019 Logical Access dataset and Multi-Language Audio Anti-Spoofing Dataset (MLAAD)

Model(s)

ResNet, self-supervised learning (SSL) with wav2vec 2.0, Whisper

Author countries

USA

← Previous