MoLEx: Mixture of LoRA Experts in Speech Self-Supervised Models for Audio Deepfake Detection

View on arXiv ← Back to list

Authors: Zihan Pan, Sailor Hardik Bhupendra, Jinyang Wu

Published: 2025-09-11 06:18:29+00:00

AI Summary

This paper introduces MoLEx, a parameter-efficient framework for audio deepfake detection that combines Low-Rank Adaptation (LoRA) with a Mixture-of-Experts (MoE) router. MoLEx efficiently finetunes pre-trained self-supervised learning (SSL) models by only updating selected experts, achieving state-of-the-art performance with reduced computational costs.

Abstract

While self-supervised learning (SSL)-based models have boosted audio deepfake detection accuracy, fully finetuning them is computationally expensive. To address this, we propose a parameter-efficient framework that combines Low-Rank Adaptation with a Mixture-of-Experts router, called Mixture of LoRA Experts (MoLEx). It preserves pre-trained knowledge of SSL models while efficiently finetuning only selected experts, reducing training costs while maintaining robust performance. The observed utility of experts during inference shows the router reactivates the same experts for similar attacks but switches to other experts for novel spoofs, confirming MoLEx's domain-aware adaptability. MoLEx additionally offers flexibility for domain adaptation by allowing extra experts to be trained without modifying the entire model. We mainly evaluate our approach on the ASVSpoof 5 dataset and achieve the state-of-the-art (SOTA) equal error rate (EER) of 5.56% on the evaluation set without augmentation.

Key findings

MoLEx achieves state-of-the-art equal error rate (EER) of 5.56% on the ASVSpoof 5 evaluation set without augmentation. The expert utilization analysis shows MoLEx's domain-aware adaptability, reusing experts for similar attacks and switching to others for novel spoofs. MoLEx effectively adapts to new domains by adding new experts without significant performance degradation on previous datasets.

Approach

MoLEx modifies the transformer layers of a pre-trained WavLM model by incorporating LoRA adapters as experts. A gating network selects a subset of these experts for each input, reducing computational overhead. An orthogonality regularization loss is introduced to enhance the expressiveness of the LoRA experts.

Datasets

ASVSpoof 2019, ASVSpoof 2021LA/DF, ASVSpoof 5, DFADD, FakeOrReal, In the Wild, LibriSeVoc

Model(s)

WavLM (large), LSTM

Author countries

Singapore

← Previous