Every Breath You Don't Take: Deepfake Speech Detection Using Breath

Authors: Seth Layton, Thiago De Andrade, Daniel Olszewski, Kevin Warren, Kevin Butler, Patrick Traynor

Published: 2024-04-23 15:48:51+00:00

Comment: Submitted to ACM journal -- Digital Threats: Research and Practice

AI Summary

This paper proposes using breath as a high-level feature for deepfake speech detection, hypothesizing that current synthetic speech lacks natural breathing patterns. They develop a breath detector and leverage breath-related statistics from a custom dataset of in-the-wild online news audio to discriminate between real and deepfake speech. Their simple breath-based detector achieves perfect classification (1.0 AUPRC and 0.0 EER) on test data, outperforming the state-of-the-art SSL-wav2vec model.

Abstract

Deepfake speech represents a real and growing threat to systems and society. Many detectors have been created to aid in defense against speech deepfakes. While these detectors implement myriad methodologies, many rely on low-level fragments of the speech generation process. We hypothesize that breath, a higher-level part of speech, is a key component of natural speech and thus improper generation in deepfake speech is a performant discriminator. To evaluate this, we create a breath detector and leverage this against a custom dataset of online news article audio to discriminate between real/deepfake speech. Additionally, we make this custom dataset publicly available to facilitate comparison for future work. Applying our simple breath detector as a deepfake speech discriminator on in-the-wild samples allows for accurate classification (perfect 1.0 AUPRC and 0.0 EER on test data) across 33.6 hours of audio. We compare our model with the state-of-the-art SSL-wav2vec model and show that this complex deep learning model completely fails to classify the same in-the-wild samples (0.72 AUPRC and 0.99 EER).


Key findings
Breaths are automatically detectable and generalizable across speakers (RQ1). Current in-the-wild deepfake speech does not sufficiently incorporate breaths (RQ2). The breath-based deepfake detector achieves perfect classification (1.0 AUPRC and 0.0 EER) on their test dataset of in-the-wild news articles (RQ3). In contrast, the state-of-the-art SSL-wav2vec2.0 model (pretrained) significantly failed, achieving 0.72 AUPRC and 0.99 EER on the same samples.
Approach
The approach involves a two-stage pipeline: first, a breath detector identifies breath locations in audio using a multi-tiered convolutional and recurrent neural network (CNN-LSTM) processing mel-spectrogram, ZCR, and RMSE features. Second, three aggregate breath features (average breaths per minute, average breath duration, average spacing between breaths) are calculated from the detected breaths and used by simple classifiers (C-Support Vector Classification, Decision Tree, or thresholding) to distinguish between real and deepfake speech.
Datasets
Podcasts (custom curated, ~5 hours for breath detector training), and a Custom Dataset of Online News Articles (277 TTS and 56 human-read articles, totaling 52.48 hours) for deepfake detection training and testing.
Model(s)
For breath detection: A multi-tiered convolutional and recurrent neural network (CNN-LSTM). For deepfake detection: C-Support Vector Classification (SVC), a three-tiered Decision Tree, and a simple Thresholding classifier. For comparison: SSL-wav2vec2.0 (specifically, XLS-R-based deepfake detector).
Author countries
USA