Leveraging Mixture of Experts for Improved Speech Deepfake Detection

Authors: Viola Negroni, Davide Salvi, Alessandro Ilic Mezza, Paolo Bestagini, Stefano Tubaro

Published: 2024-09-24 13:24:03+00:00

AI Summary

This paper proposes a novel speech deepfake detection method using a Mixture of Experts (MoE) architecture. The MoE framework enhances generalization and adaptability to unseen data by specializing experts on different datasets, outperforming traditional single models and ensemble methods. An efficient gating mechanism dynamically assigns expert weights, optimizing detection performance.

Abstract

Speech deepfakes pose a significant threat to personal security and content authenticity. Several detectors have been proposed in the literature, and one of the primary challenges these systems have to face is the generalization over unseen data to identify fake signals across a wide range of datasets. In this paper, we introduce a novel approach for enhancing speech deepfake detection performance using a Mixture of Experts architecture. The Mixture of Experts framework is well-suited for the speech deepfake detection task due to its ability to specialize in different input types and handle data variability efficiently. This approach offers superior generalization and adaptability to unseen data compared to traditional single models or ensemble methods. Additionally, its modular structure supports scalable updates, making it more flexible in managing the evolving complexity of deepfake techniques while maintaining high detection accuracy. We propose an efficient, lightweight gating mechanism to dynamically assign expert weights for each input, optimizing detection performance. Experimental results across multiple datasets demonstrate the effectiveness and potential of our proposed approach.


Key findings
The enhanced MoE significantly outperforms single LCNN models, an ensemble of LCNNs, and a jointly trained LCNN on both seen and unseen datasets, achieving lower Equal Error Rates (EER). Analysis of the gating network reveals that it effectively selects relevant experts based on dataset characteristics, demonstrating the effectiveness of the proposed approach.
Approach
The authors employ a Mixture of Experts (MoE) architecture with multiple experts, each pre-trained on a different speech deepfake dataset. Two MoE variations are proposed: a standard MoE and an enhanced MoE which uses internal expert representations for the gating network. The gating network dynamically weights the experts' outputs to produce a final classification.
Datasets
ASVspoof 2019 (DASV), FakeOrReal (DFoR), ADD 2022 (DADD), In-the-Wild (DItW), Purdue speech dataset (DPUR), TIMIT-TTS (DTIM)
Model(s)
LCNN (Lightweight Convolutional Neural Network) as both baseline and expert within the MoE framework.
Author countries
Italy