ASASVIcomtech: The Vicomtech-UGR Speech Deepfake Detection and SASV Systems for the ASVspoof5 Challenge

Authors: Juan M. Martín-Doñas, Eros Roselló, Angel M. Gomez, Aitor Álvarez, Iván López-Espejo, Antonio M. Peinado

Published: 2024-08-19 18:57:34+00:00

AI Summary

This paper details the ASASVIcomtech team's participation in the ASVspoof5 Challenge, focusing on speech deepfake detection (Track 1) and spoofing-aware speaker verification (Track 2). While a closed-condition system using a DCCRN yielded unsatisfactory results, an open-condition ensemble system leveraging self-supervised models and augmented data achieved highly competitive results.

Abstract

This paper presents the work carried out by the ASASVIcomtech team, made up of researchers from Vicomtech and University of Granada, for the ASVspoof5 Challenge. The team has participated in both Track 1 (speech deepfake detection) and Track 2 (spoofing-aware speaker verification). This work started with an analysis of the challenge available data, which was regarded as an essential step to avoid later potential biases of the trained models, and whose main conclusions are presented here. With respect to the proposed approaches, a closed-condition system employing a deep complex convolutional recurrent architecture was developed for Track 1, although, unfortunately, no noteworthy results were achieved. On the other hand, different possibilities of open-condition systems, based on leveraging self-supervised models, augmented training data from previous challenges, and novel vocoders, were explored for both tracks, finally achieving very competitive results with an ensemble system.


Key findings
The open-condition ensemble system significantly outperformed the closed-condition system. The ensemble approach, incorporating self-supervised models and data augmentation, achieved competitive results in both Track 1 (speech deepfake detection) and Track 2 (spoofing-aware speaker verification) of the ASVspoof5 Challenge. Calibration of scores proved crucial for optimal performance.
Approach
For Track 1 (speech deepfake detection), the authors employed an ensemble system using pre-trained self-supervised models (Wav2Vec2-Large and WavLM-Base) as feature extractors, followed by fine-tuned downstream classifiers. For Track 2 (spoofing-aware speaker verification), they combined this system with a pre-trained TitaNet-Large ASV model using non-linear score fusion.
Datasets
ASVspoof5 training and development datasets, ASVspoof 2019 training and development datasets, VCTK database, LibriSpeech corpus, additional vocoded data from Voc.v4 partition.
Model(s)
Deep Complex Convolutional Recurrent Network (DCCRN), Wav2Vec2-Large, WavLM-Base, TitaNet-Large, various downstream classifiers (including NN-ASP), logistic regression calibrator, beta calibrator.
Author countries
Spain