PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection

View on arXiv ← Back to list

Authors: Oguzhan Baser, Ahmet Ege Tanriverdi, Sriram Vishwanath, Sandeep P. Chinchali

Published: 2025-06-28 06:56:41+00:00

AI Summary

This paper introduces PhonemeFake (PF), a new deepfake attack that manipulates crucial speech segments using language reasoning, making deepfakes more realistic and harder to detect. A novel bilevel detection model, PhonemeFakeDetect (PFD), is also presented, significantly improving detection accuracy and efficiency by focusing computation on manipulated regions.

Abstract

Deepfake (DF) attacks pose a growing threat as generative models become increasingly advanced. However, our study reveals that existing DF datasets fail to deceive human perception, unlike real DF attacks that influence public discourse. It highlights the need for more realistic DF attack vectors. We introduce PhonemeFake (PF), a DF attack that manipulates critical speech segments using language reasoning, significantly reducing human perception by up to 42% and benchmark accuracies by up to 94%. We release an easy-to-use PF dataset on HuggingFace and open-source bilevel DF segment detection model that adaptively prioritizes compute on manipulated regions. Our extensive experiments across three known DF datasets reveal that our detection model reduces EER by 91% while achieving up to 90% speed-up, with minimal compute overhead and precise localization beyond existing models as a scalable solution.

Key findings

The PhonemeFake dataset significantly reduces the accuracy of state-of-the-art deepfake detection models. The proposed PhonemeFakeDetect model achieves up to 91% reduction in Equal Error Rate (EER) and up to 90% speedup compared to existing models, demonstrating its effectiveness in detecting subtle, segmental manipulations.

Approach

The authors propose a two-level detection model. A low-frequency (LF) stream pre-processes the audio to identify regions of interest (ROIs). A high-frequency (HF) stream then performs a detailed analysis of these ROIs using a more computationally intensive model. A gating mechanism dynamically activates the HF stream only when needed, balancing accuracy and efficiency.

Datasets

WaveFake (WF), In-the-Wild Audio DF (ITW), ASVspoof21 (ASV), SpoofCeleb (SC), Half-truth (HAD), and the newly created PhonemeFake (PF) dataset.

Model(s)

PhonemeFakeDetect (PFD) which uses a bilevel LSTM architecture with a dynamic gating mechanism. For comparison, SASV2, SCL, AASIST, RawBMamba RM, and a finetuned MMA model were also used.

Author countries

USA

← Previous