What Counts as Real? Speech Restoration and Voice Quality Conversion Pose New Challenges to Deepfake Detection

Authors: Shree Harsha Bokkahalli Satish, Harm Lameris, Joakim Gustafson, Éva Székely

Published: 2026-03-14 17:15:31+00:00

Comment: 5 pages, 4 figures, 3 tables. Submitted to Interspeech 2026

AI Summary

This paper demonstrates that current binary audio anti-spoofing systems misclassify benign transformations, such as voice quality conversion and speech restoration, as spoofed speech, causing high false-positive rates. The authors propose a 4-way classification framework that explicitly disentangles bona fide, processed bona fide, spoofed, and processed spoofed speech. This multi-class approach improves robustness to benign shifts and enhances spoof detection by modeling authenticity more accurately.

Abstract

Audio anti-spoofing systems are typically formulated as binary classifiers distinguishing bona fide from spoofed speech. This assumption fails under layered generative processing, where benign transformations introduce distributional shifts that are misclassified as spoofing. We show that phonation-modifying voice conversion and speech restoration are treated as out-of-distribution despite preserving speaker authenticity. Using a multi-class setup separating bona fide, converted, spoofed, and converted-spoofed speech, we analyse model behaviour through self-supervised learning (SSL) embeddings and acoustic correlates. The benign transformations induce a drift in the SSL space, compressing bona fide and spoofed speech and reducing classifier separability. Reformulating anti-spoofing as a multi-class problem improves robustness to benign shifts while preserving spoof detection, suggesting binary systems model the distribution of raw speech rather than authenticity itself.


Key findings
Binary anti-spoofing systems consistently misclassify benign processed speech as spoofing, particularly out-of-domain, suggesting they model raw speech distribution rather than authenticity. Reformulating anti-spoofing as a 4-way multi-class problem significantly improves robustness to benign shifts and enhances cross-domain spoof detection. This approach achieved an ASVspoof5 accuracy of 86.8% and bona fide accuracy of 94.7%, outperforming binary supervision.
Approach
The authors analyze the behavior of anti-spoofing systems using self-supervised learning (SSL) embeddings (HuBERT, Whisper, Wav2Vec2) and acoustic correlates under benign transformations. They propose and evaluate a multi-class classification setup that categorizes audio into bona fide, converted bona fide, spoofed, and converted spoofed, comparing its performance against traditional binary classifiers. Models are fine-tuned using an MLP and a modified DF-Arena 1B architecture.
Datasets
M-AILABS corpus, MLAAD corpus, ASVspoof5 dataset. They also generate custom datasets by applying Voice Quality Conversion (VQC) and Speech Restoration (Sidon) to the M-AILABS and MLAAD corpora.
Model(s)
HuBERT-base, Whisper-small, Wav2Vec2 XLS-R 1B (for SSL embeddings and as backbone), DF-Arena 1B (pretrained anti-spoofing model), and an MLP classifier.
Author countries
Sweden