Phoneme-Level Analysis for Person-of-Interest Speech Deepfake Detection

Authors: Davide Salvi, Viola Negroni, Sara Mandelli, Paolo Bestagini, Stefano Tubaro

Published: 2025-07-11 14:27:57+00:00

AI Summary

This paper proposes a phoneme-level Person-of-Interest (POI) based speech deepfake detection method. It analyzes individual phonemes in reference and test audio to create speaker profiles and compare them for detecting synthetic artifacts, achieving comparable accuracy to traditional methods with improved robustness and interpretability.

Abstract

Recent advances in generative AI have made the creation of speech deepfakes widely accessible, posing serious challenges to digital trust. To counter this, various speech deepfake detection strategies have been proposed, including Person-of-Interest (POI) approaches, which focus on identifying impersonations of specific individuals by modeling and analyzing their unique vocal traits. Despite their excellent performance, the existing methods offer limited granularity and lack interpretability. In this work, we propose a POI-based speech deepfake detection method that operates at the phoneme level. Our approach decomposes reference audio into phonemes to construct a detailed speaker profile. In inference, phonemes from a test sample are individually compared against this profile, enabling fine-grained detection of synthetic artifacts. The proposed method achieves comparable accuracy to traditional approaches while offering superior robustness and interpretability, key aspects in multimedia forensics. By focusing on phoneme analysis, this work explores a novel direction for explainable, speaker-centric deepfake detection.


Key findings
The phoneme-level approach achieves comparable accuracy to a traditional POI method while processing significantly less audio data. It demonstrates superior robustness to post-processing effects like noise and compression. Phoneme-level analysis provides enhanced interpretability by identifying specific phonemes contributing to deepfake detection.
Approach
The method decomposes audio into phonemes, extracts features for each phoneme, and constructs a speaker profile from reference audio. Test audio phonemes are compared to this profile using cosine distance, and the aggregated distances classify the audio as real or fake.
Datasets
ASVspoof 2019, In-the-Wild, Purdue speech dataset, TIMIT-TTS, LJSpeech, LibriSpeech, LibriVox, VCTK
Model(s)
Fine-tuned Wav2Vec 2.0 for phoneme extraction and base Wav2Vec 2.0 for feature extraction.
Author countries
Italy