SA-SASV: An End-to-End Spoof-Aggregated Spoofing-Aware Speaker Verification System

Authors: Zhongwei Teng, Quchen Fu, Jules White, Maria E. Powell, Douglas C. Schmidt

Published: 2022-03-12 21:15:59+00:00

AI Summary

This paper presents SA-SASV, an end-to-end spoofing-aware speaker verification system that uses multi-task classifiers optimized by multiple losses. Unlike previous approaches, SA-SASV avoids ensemble methods and offers more flexible training set requirements. It achieves improved performance on the ASVSpoof 2019 LA dataset.

Abstract

Research in the past several years has boosted the performance of automatic speaker verification systems and countermeasure systems to deliver low Equal Error Rates (EERs) on each system. However, research on joint optimization of both systems is still limited. The Spoofing-Aware Speaker Verification (SASV) 2022 challenge was proposed to encourage the development of integrated SASV systems with new metrics to evaluate joint model performance. This paper proposes an ensemble-free end-to-end solution, known as Spoof-Aggregated-SASV (SA-SASV) to build a SASV system with multi-task classifiers, which are optimized by multiple losses and has more flexible requirements in training set. The proposed system is trained on the ASVSpoof 2019 LA dataset, a spoof verification dataset with small number of bonafide speakers. Results of SASV-EER indicate that the model performance can be further improved by training in complete automatic speaker verification and countermeasure datasets.


Key findings
The proposed SA-SASV system significantly outperforms baseline SASV systems and previous state-of-the-art approaches on the ASVSpoof 2019 LA dataset, achieving a SASV-EER of 4.86%. Ablation studies demonstrate the importance of both the spoof aggregator and the spoof-source-based triplet loss for performance. However, the model shows limitations in generalizing to unseen speakers due to overfitting, suggesting the need for larger datasets.
Approach
SA-SASV uses a multi-task learning approach with multiple loss functions to jointly optimize speaker verification and spoof detection. It combines a pre-trained ASV system with a lightweight raw waveform encoder and employs spoof-source-based triplet loss to enhance feature space aggregation and separation. The final decision is based on cosine similarity and CM scores from the same model.
Datasets
ASVSpoof 2019 LA dataset, VoxCeleb2 dataset (for pre-training)
Model(s)
ECAPA-TDNN (pre-trained), ARawNet (lightweight raw waveform encoder), multi-task classifiers with AAM-softmax and binary cross-entropy losses, spoof aggregator with adversarial learning, and spoof-source-based triplet loss.
Author countries
USA