Gender Fairness in Audio Deepfake Detection: Performance and Disparity Analysis

Authors: Aishwarya Fursule, Shruti Kshirsagar, Anderson R. Avila

Published: 2026-03-09 22:52:12+00:00

Comment: 6 pages, 3 Figures

AI Summary

This paper conducts a thorough analysis of gender fairness in audio deepfake detection models, an area previously underexplored. The study uses the ASVspoof 5 dataset, training a ResNet-18 classifier with various audio features and comparing it against the baseline AASIST model. It incorporates five established fairness metrics alongside conventional Equal Error Rate (EER) to quantify and understand gender-dependent performance disparities.

Abstract

Audio deepfake detection aims to detect real human voices from those generated by Artificial Intelligence (AI) and has emerged as a significant problem in the field of voice biometrics systems. With the ever-improving quality of synthetic voice, the probability of such a voice being exploited for illicit practices like identity thest and impersonation increases. Although significant progress has been made in the field of Audio Deepfake Detection in recent times, the issue of gender bias remains underexplored and in its nascent stage In this paper, we have attempted a thorough analysis of gender dependent performance and fairness in audio deepfake detection models. We have used the ASVspoof 5 dataset and train a ResNet-18 classifier and evaluate detection performance across four different audio features, and compared the performance with baseline AASIST model. Beyond conventional metrics such as Equal Error Rate (EER %), we incorporated five established fairness metrics to quantify gender disparities in the model. Our results show that even when the overall EER difference between genders appears low, fairness-aware evaluation reveals disparities in error distribution that are obscured by aggregate performance measures. These findings demonstrate that reliance on standard metrics is unreliable, whereas fairness metrics provide critical insights into demographic-specific failure modes. This work highlights the importance of fairness-aware evaluation for developing a more equitable, robust, and trustworthy audio deepfake detection system.


Key findings
The study revealed that while overall EER differences between genders might appear low, fairness-aware evaluations uncover significant and statistically significant disparities in error distribution. Performance and bias direction vary systematically with feature choice, demonstrating that reliance on standard metrics like EER is insufficient to capture demographic bias. The findings emphasize the necessity of incorporating comprehensive fairness metrics for developing equitable and robust audio deepfake detection systems.
Approach
The authors analyze gender-dependent performance and fairness in audio deepfake detection by training a ResNet-18 classifier on the ASVspoof 5 dataset, leveraging four different audio features (Log-Spectrogram, CQT, WavLM, Wav2Vec 2.0 embeddings). They compare its performance with the baseline AASIST model and evaluate both using conventional EER and five established fairness metrics (Statistical Parity, Equal Opportunity, Equality of Odds, Predictive Parity, Treatment Equality) to quantify gender disparities.
Datasets
ASVspoof5 dataset
Model(s)
ResNet-18, AASIST
Author countries
USA, Canada