SZU-AFS Antispoofing System for the ASVspoof 5 Challenge

View on arXiv ← Back to list

Authors: Yuxiong Xu, Jiafeng Zhong, Sengui Zheng, Zefeng Liu, Bin Li

Published: 2024-08-19 12:12:29+00:00

AI Summary

This paper introduces SZU-AFS, an anti-spoofing system for the ASVspoof 5 Challenge, focusing on standalone speech deepfake detection. The system leverages a four-stage approach: baseline model selection, data augmentation exploration, a co-enhancement strategy using gradient norm aware minimization (GAM), and logit score fusion, achieving a minDCF of 0.115 and an EER of 4.04% on the evaluation set.

Abstract

This paper presents the SZU-AFS anti-spoofing system, designed for Track 1 of the ASVspoof 5 Challenge under open conditions. The system is built with four stages: selecting a baseline model, exploring effective data augmentation (DA) methods for fine-tuning, applying a co-enhancement strategy based on gradient norm aware minimization (GAM) for secondary fine-tuning, and fusing logits scores from the two best-performing fine-tuned models. The system utilizes the Wav2Vec2 front-end feature extractor and the AASIST back-end classifier as the baseline model. During model fine-tuning, three distinct DA policies have been investigated: single-DA, random-DA, and cascade-DA. Moreover, the employed GAM-based co-enhancement strategy, designed to fine-tune the augmented model at both data and optimizer levels, helps the Adam optimizer find flatter minima, thereby boosting model generalization. Overall, the final fusion system achieves a minDCF of 0.115 and an EER of 4.04% on the evaluation set.

Key findings

The RIR-TimeMask data augmentation method proved highly effective. The GAM-based co-enhancement strategy significantly improved model generalization. The final fused system achieved a minDCF of 0.115 and an EER of 4.04% on the evaluation set.

Approach

SZU-AFS uses a four-stage process. First, it selects a baseline model combining Wav2Vec2 and AASIST. Second, it explores data augmentation (DA) methods for fine-tuning. Third, it applies a GAM-based co-enhancement strategy for secondary fine-tuning. Finally, it fuses logits from the two best-performing models.

Datasets

ASVspoof 2025 Track 1 dataset (training, development, progress, and evaluation sets), which includes data from the Multilingual Librispeech English partition.

Model(s)

Wav2Vec2 (feature extractor), AASIST (classifier), and a fully connected layer for score fusion.

Author countries

China

← Previous