Every Breath You Don't Take: Deepfake Speech Detection Using Breath

Authors: Seth Layton, Thiago De Andrade, Daniel Olszewski, Kevin Warren, Kevin Butler, Patrick Traynor

Published: 2024-04-23 15:48:51+00:00

AI Summary

This paper proposes a novel deepfake speech detection method using breath as a discriminator. A breath detector is trained and used to extract breath features from audio samples, which are then used to classify real and deepfake speech with high accuracy, outperforming a state-of-the-art model.

Abstract

Deepfake speech represents a real and growing threat to systems and society. Many detectors have been created to aid in defense against speech deepfakes. While these detectors implement myriad methodologies, many rely on low-level fragments of the speech generation process. We hypothesize that breath, a higher-level part of speech, is a key component of natural speech and thus improper generation in deepfake speech is a performant discriminator. To evaluate this, we create a breath detector and leverage this against a custom dataset of online news article audio to discriminate between real/deepfake speech. Additionally, we make this custom dataset publicly available to facilitate comparison for future work. Applying our simple breath detector as a deepfake speech discriminator on in-the-wild samples allows for accurate classification (perfect 1.0 AUPRC and 0.0 EER on test data) across 33.6 hours of audio. We compare our model with the state-of-the-art SSL-wav2vec model and show that this complex deep learning model completely fails to classify the same in-the-wild samples (0.72 AUPRC and 0.99 EER).


Key findings
The proposed breath-based detector achieves perfect accuracy (1.0 AUPRC, 0.0 EER) on the test set, significantly outperforming the state-of-the-art SSL-wav2vec model (0.72 AUPRC, 0.99 EER). Simple classifiers using breath features are effective in distinguishing real and deepfake speech.
Approach
The approach uses a multi-tiered pipeline. First, a breath detector based on convolutional and recurrent neural networks is trained on podcast data. Then, breath features (average breaths per minute, duration, and spacing) are extracted from news article audio and used with simple classifiers (SVC, decision tree, thresholding) to discriminate between real and deepfake speech.
Datasets
A custom dataset of online news article audio (real and text-to-speech generated), and a separate dataset of podcasts for training the breath detector.
Model(s)
Convolutional and recurrent neural networks for breath detection; Support Vector Classifier (SVC), decision tree, and thresholding for deepfake classification; Comparison against a pre-trained and fine-tuned SSL-wav2vec 2.0 model.
Author countries
USA