Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes

View on arXiv ← Back to list

Authors: Kuiyuan Zhang, Zhongyun Hua, Rushi Lan, Yushu Zhang, Yifang Guo

Published: 2024-12-17 07:31:19+00:00

AI Summary

This paper proposes a novel speech deepfake detection method that leverages inconsistencies in phoneme-level speech features. It introduces adaptive phoneme pooling to extract these features and a graph attention network to model their temporal dependencies, achieving superior performance over state-of-the-art methods on multiple datasets.

Abstract

Recent advancements in text-to-speech and speech conversion technologies have enabled the creation of highly convincing synthetic speech. While these innovations offer numerous practical benefits, they also cause significant security challenges when maliciously misused. Therefore, there is an urgent need to detect these synthetic speech signals. Phoneme features provide a powerful speech representation for deepfake detection. However, previous phoneme-based detection approaches typically focused on specific phonemes, overlooking temporal inconsistencies across the entire phoneme sequence. In this paper, we develop a new mechanism for detecting speech deepfakes by identifying the inconsistencies of phoneme-level speech features. We design an adaptive phoneme pooling technique that extracts sample-specific phoneme-level features from frame-level speech data. By applying this technique to features extracted by pre-trained audio models on previously unseen deepfake datasets, we demonstrate that deepfake samples often exhibit phoneme-level inconsistencies when compared to genuine speech. To further enhance detection accuracy, we propose a deepfake detector that uses a graph attention network to model the temporal dependencies of phoneme-level features. Additionally, we introduce a random phoneme substitution augmentation technique to increase feature diversity during training. Extensive experiments on four benchmark datasets demonstrate the superior performance of our method over existing state-of-the-art detection methods.

Key findings

The proposed method significantly outperforms state-of-the-art deepfake detection methods on multiple datasets, demonstrating robustness to noise and compression artifacts. The use of phoneme-level features and their temporal dependencies proves crucial for improved detection accuracy.

Approach

The approach uses a pre-trained phoneme recognition model to extract phoneme-level features from frame-level speech data via adaptive phoneme pooling. A graph attention network then models the temporal dependencies of these features for deepfake classification. Random phoneme substitution augmentation is used to enhance training.

Datasets

ASVspoof2019, ASVspoof2021, MLAAD, InTheWild, Musan (for robustness testing), Common Voice 6.1 (for phoneme recognition model pre-training)

Model(s)

Wav2Vec2, WavLM (as backbones for feature extraction), Graph Attention Network (GAT), pre-trained phoneme recognition model (trained on Common Voice 6.1)

Author countries

China

← Previous