Investigating the Impact of Speech Enhancement on Audio Deepfake Detection in Noisy Environments

Authors: Anacin, Angela, Shruti Kshirsagar, Anderson R. Avila

Published: 2026-03-16 03:02:35+00:00

AI Summary

This paper investigates the correlation between speech quality and the performance of audio spoofing detection systems (LA task) in noisy environments. It evaluates two speech enhancement algorithms, SEGAN and MetricGAN+, on their impact on a state-of-the-art audio deepfake detection system. The study reveals that the enhancement algorithm leading to lower perceptual speech quality scores (SEGAN) surprisingly provides better deepfake detection performance (lower EER), suggesting a complex interplay between speech enhancement and anti-spoofing.

Abstract

Logical Access (LA) attacks, also known as audio deepfake attacks, use Text-to-Speech (TTS) or Voice Conversion (VC) methods to generate spoofed speech data. This can represent a serious threat to Automatic Speaker Verification (ASV) systems, as intruders can use such attacks to bypass voice biometric security. In this study, we investigate the correlation between speech quality and the performance of audio spoofing detection systems (i.e., LA task). For that, the performance of two enhancement algorithms is evaluated based on two perceptual speech quality measures, namely Perceptual Evaluation of Speech Quality (PESQ) and Speech-to-Reverberation Modulation Ratio (SRMR), and in respect to their impact on the audio spoofing detection system. We adopted the LA dataset, provided in the ASVspoof 2019 Challenge, and corrupted its test set with different Signal-to-Noise Ratio (SNR) levels, while leaving the training data untouched. Enhancement was applied to attenuate the detrimental effects of noisy speech, and the performances of two models, Speech Enhancement Generative Adversarial Network (SEGAN) and Metric-Optimized Generative Adversarial Network Plus (MetricGAN+), were compared. Although we expect that speech quality will correlate well with speech applications' performance, it can also have as a side effect on downstream tasks if unwanted artifacts are introduced or relevant information is removed from the speech signal. Our results corroborate with this hypothesis, as we found that the enhancement algorithm leading to the highest speech quality scores, MetricGAN+, provided the lowest Equal Error Rate (EER) on the audio spoofing detection task, whereas the enhancement method with the lowest speech quality scores, SEGAN, led to the lowest EER, thus leading to better performance on the LA task.


Key findings
SEGAN, despite yielding lower perceptual speech quality scores (PESQ, SRMR), significantly improved audio deepfake detection performance (lower EER) in noisy conditions compared to MetricGAN+ or unenhanced noisy speech. MetricGAN+, which achieved higher speech quality scores, resulted in higher EERs, suggesting it may remove crucial cues necessary for spoof detection. This highlights that enhancement optimized for perceptual quality might adversely affect downstream deepfake detection tasks.
Approach
The authors corrupt the test set of the ASVspoof 2019 LA dataset with different SNR levels of Babble and Cafeteria noise. They then apply two speech enhancement algorithms, SEGAN and MetricGAN+, to this noisy data. The enhanced (and noisy unenhanced) audio is subsequently fed into the AASIST audio deepfake detection model, and its performance is assessed using EER and t-DCF, correlating these with speech quality measures (PESQ, SRMR).
Datasets
ASVspoof 2019 Logical Access (LA) dataset (test set corrupted with Babble and Cafeteria noise at various SNR levels)
Model(s)
Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks (AASIST) for detection; Speech Enhancement Generative Adversarial Network (SEGAN) and Metric-Optimized Generative Adversarial Network Plus (MetricGAN+) for speech enhancement.
Author countries
USA, Canada