Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis

View on arXiv ← Back to list

Authors: Kevin Warren, Daniel Olszewski, Seth Layton, Kevin Butler, Carrie Gates, Patrick Traynor

Published: 2025-02-20 16:52:55+00:00

AI Summary

This paper proposes a novel audio deepfake detection method using six classical prosodic features (pitch, jitter, shimmer, HNR). The model achieves 93% accuracy and a 24.7% EER, comparable to existing baselines, while demonstrating enhanced robustness against adversarial attacks and providing explainability through attention mechanisms.

Abstract

Audio deepfakes are increasingly in-differentiable from organic speech, often fooling both authentication systems and human listeners. While many techniques use low-level audio features or optimization black-box model training, focusing on the features that humans use to recognize speech will likely be a more long-term robust approach to detection. We explore the use of prosody, or the high-level linguistic features of human speech (e.g., pitch, intonation, jitter) as a more foundational means of detecting audio deepfakes. We develop a detector based on six classical prosodic features and demonstrate that our model performs as well as other baseline models used by the community to detect audio deepfakes with an accuracy of 93% and an EER of 24.7%. More importantly, we demonstrate the benefits of using a linguistic features-based approach over existing models by applying an adaptive adversary using an $L_{infty}$ norm attack against the detectors and using attention mechanisms in our training for explainability. We show that we can explain the prosodic features that have highest impact on the model's decision (Jitter, Shimmer and Mean Fundamental Frequency) and that other models are extremely susceptible to simple $L_{infty}$ norm attacks (99.3% relative degradation in accuracy). While overall performance may be similar, we illustrate the robustness and explainability benefits to a prosody feature approach to audio deepfake detection.

Key findings

The prosodic-feature-based model achieved comparable performance to existing baselines in detecting audio deepfakes. The model showed greater robustness against L∞ norm attacks compared to baseline models. Attention mechanisms successfully identified key prosodic features (jitter, shimmer, mean F0) influencing classification decisions.

Approach

The researchers developed a deepfake detector using six prosodic features extracted from audio samples. An LSTM network was trained on these features, and attention mechanisms were incorporated to enhance explainability. Adversarial testing was conducted to assess robustness.

Datasets

ASVspoof2021 dataset (deepfake track)

Model(s)

LSTM network with attention mechanisms

Author countries

UNKNOWN

← Previous