Unsupervised Domain Adaptation for Audio Deepfake Detection with Modular Statistical Transformations

Authors: Urawee Thani, Gagandeep Singh, Priyanka Singh

Published: 2026-03-09 04:00:38+00:00

Comment: 9 pages, 4 figures

AI Summary

This paper addresses the challenge of cross-domain generalization in audio deepfake detection by presenting a modular pipeline for unsupervised domain adaptation. It leverages pre-trained Wav2Vec 2.0 embeddings combined with a series of statistical transformations including power transformation, ANOVA-based feature selection, joint PCA, and CORAL alignment, followed by logistic regression. The approach significantly improves cross-domain accuracy over baselines, demonstrating the effectiveness of its interpretable and modular components.

Abstract

Audio deepfake detection systems trained on one dataset often fail when deployed on data from different sources due to distributional shifts in recording conditions, synthesis methods, and acoustic environments. We present a modular pipeline for unsupervised domain adaptation that combines pre-trained Wav2Vec 2.0 embeddings with statistical transformations to improve cross-domain generalization without requiring labeled target data. Our approach applies power transformation for feature normalization, ANOVA-based feature selection, joint PCA for domain-agnostic dimensionality reduction, and CORAL alignment to match source and target covariance structures before classification via logistic regression. We evaluate on two cross-domain transfer scenarios: ASVspoof 2019 LA to Fake-or-Real (FoR) and FoR to ASVspoof, achieving 62.7--63.6\\% accuracy with balanced performance across real and fake classes. Systematic ablation experiments reveal that feature selection (+3.5%) and CORAL alignment (+3.2%) provide the largest individual contributions, with the complete pipeline improving accuracy by 10.7% over baseline. While performance is modest compared to within-domain detection (94-96%), our pipeline offers transparency and modularity, making it suitable for deployment scenarios requiring interpretable decisions.

Key findings

The proposed pipeline achieved 62.7–63.6% accuracy in cross-domain transfer scenarios, representing a 10.7% improvement over the baseline. Ablation studies revealed that feature selection (+3.5%) and CORAL alignment (+3.2%) provided the largest individual contributions. While cross-domain performance is modest compared to in-domain results (94-96%), the pipeline offers critical advantages in transparency and modularity for deployment scenarios.

Approach

The approach utilizes pre-trained Wav2Vec 2.0 embeddings, which are then processed through a modular pipeline. This pipeline includes power transformation for feature normalization, ANOVA-based feature selection, joint PCA for dimensionality reduction, and CORAL alignment to match source and target covariance structures, before final classification using logistic regression.

Datasets

ASVspoof 2019 Logical Access (LA), Fake-or-Real (FoR)

Model(s)

Wav2Vec 2.0, Logistic Regression

Author countries

UNKNOWN

← Previous