Reliable Audio Deepfake Detection in Variable Conditions via Quantum-Kernel SVMs

Authors: Lisan Al Amin, Vandana P. Janeja

Published: 2025-12-21 16:31:05+00:00

Comment: This paper is accepted in ICDM 2025-MLC workshop

AI Summary

This paper introduces the use of quantum-kernel Support Vector Machines (QSVMs) for robust audio deepfake detection in conditions with scarce labeled data and varying recording environments. The authors demonstrate that QSVMs significantly reduce false-positive rates and equal-error rates (EER) compared to classical SVMs, leveraging quantum feature maps to achieve superior class separability without increasing model size. The approach provides consistent performance gains across diverse datasets, making it a viable drop-in alternative for practical deepfake detection pipelines.

Abstract

Detecting synthetic speech is challenging when labeled data are scarce and recording conditions vary. Existing end-to-end deep models often overfit or fail to generalize, and while kernel methods can remain competitive, their performance heavily depends on the chosen kernel. Here, we show that using a quantum kernel in audio deepfake detection reduces falsepositive rates without increasing model size. Quantum feature maps embed data into high-dimensional Hilbert spaces, enabling the use of expressive similarity measures and compact classifiers. Building on this motivation, we compare quantum-kernel SVMs (QSVMs) with classical SVMs using identical mel-spectrogram preprocessing and stratified 5-fold cross-validation across four corpora (ASVspoof 2019 LA, ASVspoof 5 (2024), ADD23, and an In-the-Wild set). QSVMs achieve consistently lower equalerror rates (EER): 0.183 vs. 0.299 on ASVspoof 5 (2024), 0.081 vs. 0.188 on ADD23, 0.346 vs. 0.399 on ASVspoof 2019, and 0.355 vs. 0.413 In-the-Wild. At the EER operating point (where FPR equals FNR), these correspond to absolute false-positiverate reductions of 0.116 (38.8%), 0.107 (56.9%), 0.053 (13.3%), and 0.058 (14.0%), respectively. We also report how consistent the results are across cross-validation folds and margin-based measures of class separation, using identical settings for both models. The only modification is the kernel; the features and SVM remain unchanged, no additional trainable parameters are introduced, and the quantum kernel is computed on a conventional computer.


Key findings
QSVMs consistently achieved lower equal-error rates (EER) across all evaluated datasets, including a reduction from 0.299 to 0.183 on ASVspoof 5 (2024) and 0.188 to 0.081 on ADD23. These improvements correspond to significant absolute false-positive rate reductions ranging from 13.3% to 56.9%. The quantum kernels enhanced feature separability and produced more stable error rates across cross-validation folds without adding trainable parameters to the model.
Approach
The authors transform raw audio into mel-spectrogram features, which are then min-max scaled and reduced using fold-specific PCA. They employ a controlled "kernel-swap" experimental pipeline where these identical features are fed to a classical SVM solver, using either a standard classical kernel or a quantum kernel computed via a parameterized quantum feature map in classical simulation. This methodology isolates the performance differences to the kernel type, ensuring a fair comparison.
Datasets
ASVspoof 2019 LA, ASVspoof 5 (2024), In-the-Wild, ADD23
Model(s)
Quantum Support Vector Machine (QSVM), Support Vector Machine (SVM) with mel-spectrogram features
Author countries
USA