Adversarial Speaker Distillation for Countermeasure Model on Automatic Speaker Verification

Authors: Yen-Lun Liao, Xuanjun Chen, Chung-Che Wang, Jyh-Shing Roger Jang

Published: 2022-03-31 13:52:43+00:00

AI Summary

This paper proposes an adversarial speaker distillation method for creating lightweight countermeasure (CM) models for automatic speaker verification (ASV) systems. The method combines generalized end-to-end (GE2E) pre-training, adversarial fine-tuning, and knowledge distillation to achieve a smaller model size while maintaining high performance in detecting spoofed audio.

Abstract

The countermeasure (CM) model is developed to protect ASV systems from spoof attacks and prevent resulting personal information leakage in Automatic Speaker Verification (ASV) system. Based on practicality and security considerations, the CM model is usually deployed on edge devices, which have more limited computing resources and storage space than cloud-based systems, confining the model size under a limitation. To better trade off the CM model sizes and performance, we proposed an adversarial speaker distillation method, which is an improved version of knowledge distillation method combined with generalized end-to-end (GE2E) pre-training and adversarial fine-tuning. In the evaluation phase of the ASVspoof 2021 Logical Access task, our proposed adversarial speaker distillation ResNetSE (ASD-ResNetSE) model reaches 0.2695 min t-DCF and 3.54% EER. ASD-ResNetSE only used 22.5% of parameters and 19.4% of multiply and accumulate operands of ResNetSE model.


Key findings
The proposed ASD-ResNetSE model achieved a min t-DCF of 0.2695 and an EER of 3.54% on the ASVspoof 2021 LA task. This performance was achieved with only 22.5% of the parameters and 19.4% of the multiply-accumulate operations of the original ResNetSE model, demonstrating a significant improvement in efficiency without sacrificing accuracy compared to other state-of-the-art models.
Approach
The authors address the problem of creating lightweight ASV spoofing CM models by using an adversarial speaker distillation method. This involves pre-training a teacher model with GE2E loss, fine-tuning it with adversarial examples, and then distilling the knowledge to a smaller student model using knowledge distillation loss. This balances model size and performance.
Datasets
ASVspoof 2019 (for validation) and ASVspoof 2021 Logical Access (LA) dataset (for evaluation). MUSAN and RIR noise datasets were used for data augmentation.
Model(s)
ResNetSE (teacher and student models), modified with GE2E pre-training and adversarial fine-tuning for the teacher model.
Author countries
Taiwan