SZU-AFS Antispoofing System for the ASVspoof 5 Challenge

Authors: Yuxiong Xu, Jiafeng Zhong, Sengui Zheng, Zefeng Liu, Bin Li

Published: 2024-08-19 12:12:29+00:00

Comment: 8 pages, 2 figures, ASVspoof 5 Workshop (Interspeech2024 Satellite)

AI Summary

This paper introduces the SZU-AFS anti-spoofing system for the ASVspoof 5 Challenge Track 1, which employs a four-stage approach: baseline model selection, data augmentation (DA) for fine-tuning, gradient norm aware minimization (GAM) for secondary fine-tuning, and score-level fusion. The system leverages a Wav2Vec2 feature extractor and an AASIST classifier, enhanced by various DA policies and GAM-based co-enhancement. The final fused system achieved a minDCF of 0.115 and an EER of 4.04% on the evaluation set.

Abstract

This paper presents the SZU-AFS anti-spoofing system, designed for Track 1 of the ASVspoof 5 Challenge under open conditions. The system is built with four stages: selecting a baseline model, exploring effective data augmentation (DA) methods for fine-tuning, applying a co-enhancement strategy based on gradient norm aware minimization (GAM) for secondary fine-tuning, and fusing logits scores from the two best-performing fine-tuned models. The system utilizes the Wav2Vec2 front-end feature extractor and the AASIST back-end classifier as the baseline model. During model fine-tuning, three distinct DA policies have been investigated: single-DA, random-DA, and cascade-DA. Moreover, the employed GAM-based co-enhancement strategy, designed to fine-tune the augmented model at both data and optimizer levels, helps the Adam optimizer find flatter minima, thereby boosting model generalization. Overall, the final fusion system achieves a minDCF of 0.115 and an EER of 4.04% on the evaluation set.


Key findings
The system achieved a minDCF of 0.115 and an EER of 4.04% on the ASVspoof 5 challenge evaluation set. Applying the RIR-TimeMask data augmentation method and employing a cascade-DA strategy were found to be effective. Additionally, the GAM method significantly improved model generalization when combined with the Adam optimizer, despite requiring longer training times.
Approach
The system tackles deepfake speech detection through a four-stage process: selecting Wav2Vec2 and AASIST as the baseline, exploring single, random, and cascade data augmentation policies for primary fine-tuning, applying a GAM-based co-enhancement strategy for secondary fine-tuning to improve generalization, and finally fusing logits scores from the two best-performing fine-tuned models.
Datasets
ASVspoof 5 Challenge Track 1 database (training, development, progress, evaluation sets), Multilingual Librispeech English partition.
Model(s)
Wav2Vec2 (front-end feature extractor), AASIST (back-end classifier). Other models explored include WavLM, HuBERT (feature extractors) and Fully Connected, Conformer (classifiers).
Author countries
China