Physics-Guided Deepfake Detection for Voice Authentication Systems

Authors: Alireza Mohammadi, Keshav Sood, Dhananjay Thiruvady, Asef Nazari

Published: 2025-12-04 23:37:18+00:00

AI Summary

This paper introduces a framework designed for voice authentication systems at the network edge, addressing the dual threats of deepfake synthesis attacks and control-plane poisoning in federated learning. The approach integrates interpretable physics-guided features, modeling vocal tract dynamics, with representations from a self-supervised learning module. These are processed through a Multi-Modal Ensemble Architecture and a Bayesian ensemble to provide uncertainty estimates, enhancing robustness against advanced deepfake attacks and sophisticated control-plane poisoning.

Abstract

Voice authentication systems deployed at the network edge face dual threats: a) sophisticated deepfake synthesis attacks and b) control-plane poisoning in distributed federated learning protocols. We present a framework coupling physics-guided deepfake detection with uncertainty-aware in edge learning. The framework fuses interpretable physics features modeling vocal tract dynamics with representations coming from a self-supervised learning module. The representations are then processed via a Multi-Modal Ensemble Architecture, followed by a Bayesian ensemble providing uncertainty estimates. Incorporating physics-based characteristics evaluations and uncertainty estimates of audio samples allows our proposed framework to remain robust to both advanced deepfake attacks and sophisticated control-plane poisoning, addressing the complete threat model for networked voice authentication.


Key findings
The framework achieved EERs of 6.80% on ASVspoof 2019 LA and 9.05% on ASVspoof 2021 LA, demonstrating effective cross-dataset generalization. Physics-informed features were empirically shown to provide significant discriminative power against deepfake audio samples. The Bayesian uncertainty quantification enabled robust trust assessment, with deepfakes exhibiting a 14% higher average uncertainty than genuine samples, aiding in screening malicious updates in federated learning.
Approach
The framework fuses physics-based features with self-supervised learning (SSL) embeddings, ensuring their orthogonality through QR decomposition. This combined representation is then fed into a Hybrid Detection Backbone comprising a Vision Transformer, a Graph Neural Network, and Gradient Boosting algorithms. Finally, a Bayesian ensemble using MC Dropout quantifies prediction uncertainty, aiding both deepfake classification and trust-based aggregation in edge learning.
Datasets
ASVspoof 2019 LA and PA protocols, ASVspoof 2021 LA and PA sets
Model(s)
microsoft/wavlm-large encoder, Vision Transformer (ViT), Graph Neural Network (GNN), Gradient Boosting (LightGBM), MC Dropout Sampling inference framework
Author countries
Australia