Vulnerabilities of Audio-Based Biometric Authentication Systems Against Deepfake Speech Synthesis

Authors: Mengze Hong, Di Jiang, Zeying Xie, Weiwei Zhao, Guan Wang, Chen Jason Zhang

Published: 2026-01-06 10:55:32+00:00

AI Summary

This paper empirically evaluates state-of-the-art speaker authentication systems against modern audio deepfake synthesis. It reveals two critical security vulnerabilities: commercial speaker verification systems are easily bypassed by voice cloning models trained on minimal data, and anti-spoofing detectors fail to generalize robustly to unseen deepfake generation methods. The findings highlight an urgent need for architectural innovations and adaptive multi-factor authentication strategies.

Abstract

As audio deepfakes transition from research artifacts to widely available commercial tools, robust biometric authentication faces pressing security threats in high-stakes industries. This paper presents a systematic empirical evaluation of state-of-the-art speaker authentication systems based on a large-scale speech synthesis dataset, revealing two major security vulnerabilities: 1) modern voice cloning models trained on very small samples can easily bypass commercial speaker verification systems; and 2) anti-spoofing detectors struggle to generalize across different methods of audio synthesis, leading to a significant gap between in-domain performance and real-world robustness. These findings call for a reconsideration of security measures and stress the need for architectural innovations, adaptive defenses, and the transition towards multi-factor authentication.


Key findings
State-of-the-art speaker verification systems are highly vulnerable, with bypass rates up to 82.7% against deepfake models trained on minutes of audio. Deepfake detection models, despite achieving an EER of 0.83% in-domain, showed a 30-fold degradation (EER 24.84%) when tested on unseen synthesis models. Additionally, detection performance suffered significantly under environmental noise, emphasizing poor real-world robustness and generalization.
Approach
The authors systematically evaluate state-of-the-art speaker verification and deepfake detection systems by creating a large-scale benchmark using diverse open-source voice cloning models. They assess the robustness and generalization of these systems across various attack conditions, including in-domain, out-of-domain (unseen synthesis models), cross-lingual, and noisy environments.
Datasets
VoxCeleb, AISHELL-3, ASVspoof 2021 LA, ASVspoof 2021 DF, and synthetic speech generated by GPT-SoVITS, Bert-VITS2, RVC, AdaSpeech4, BinauralGrad, MPBert, NaturalSpeech 1, NaturalSpeech 2, NaturalSpeech 3, PromptTTS 1, PromptTTS 2.
Model(s)
ECAPA-TDNN (for speaker verification), XLS-R + AASIST (for deepfake detection). Baseline comparisons included LFCC + GMM, ResNet34, RawNet2, and AASIST (standalone).
Author countries
Hong Kong, China