Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples

View on arXiv ← Back to list

Authors: Zhenyu Wang, John H. L. Hansen

Published: 2024-08-23 19:26:54+00:00

AI Summary

This paper proposes a robust synthetic audio spoofing detection system using a RawNet2-based encoder enhanced with a simple attention module, a weighted additive angular margin loss to address data imbalance, and a meta-learning framework for generalization to unseen attacks. The system also incorporates adversarial examples with an auxiliary batch normalization for disentangled training, achieving a pooled EER of 0.87% and a min t-DCF of 0.0277 on the ASVspoof 2019 LA corpus.

Abstract

Advances in automatic speaker verification (ASV) promote research into the formulation of spoofing detection systems for real-world applications. The performance of ASV systems can be degraded severely by multiple types of spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay, twins and impersonation, especially in the case of unseen synthetic spoofing attacks. A reliable and robust spoofing detection system can act as a security gate to filter out spoofing attacks instead of having them reach the ASV system. A weighted additive angular margin loss is proposed to address the data imbalance issue, and different margins has been assigned to improve generalization to unseen spoofing attacks in this study. Meanwhile, we incorporate a meta-learning loss function to optimize differences between the embeddings of support versus query set in order to learn a spoofing-category-independent embedding space for utterances. Furthermore, we craft adversarial examples by adding imperceptible perturbations to spoofing speech as a data augmentation strategy, then we use an auxiliary batch normalization (BN) to guarantee that corresponding normalization statistics are performed exclusively on the adversarial examples. Additionally, A simple attention module is integrated into the residual block to refine the feature extraction process. Evaluation results on the Logical Access (LA) track of the ASVspoof 2019 corpus provides confirmation of our proposed approaches' effectiveness in terms of a pooled EER of 0.87%, and a min t-DCF of 0.0277. These advancements offer effective options to reduce the impact of spoofing attacks on voice recognition/authentication systems.

Key findings

The proposed system significantly outperforms the baseline and several state-of-the-art systems on the ASVspoof 2019 LA dataset, achieving a pooled EER of 0.87% and a min t-DCF of 0.0277. The integration of the simple attention module, weighted loss function, meta-learning, and adversarial training all contribute to the improved performance.

Approach

The authors improve RawNet2's robustness by integrating a simple attention module for feature refinement, a weighted additive angular margin loss for handling class imbalance and unseen attacks, and a meta-learning framework. Adversarial examples with an auxiliary batch normalization are used for disentangled training to enhance model generalization.

Datasets

ASVspoof 2019 Logical Access (LA) track

Model(s)

RawNet2-based encoder with SimAM attention module, relation network for meta-learning

Author countries

USA

← Previous