ASASVIcomtech: The Vicomtech-UGR Speech Deepfake Detection and SASV Systems for the ASVspoof5 Challenge

Authors: Juan M. Martín-Doñas, Eros Roselló, Angel M. Gomez, Aitor Álvarez, Iván López-Espejo, Antonio M. Peinado

Published: 2024-08-19 18:57:34+00:00

Comment: This paper was accepted at ASVspoof Workshop 2024

AI Summary

This paper details the ASASVIcomtech team's participation in the ASVspoof5 Challenge, addressing both speech deepfake detection (Track 1) and spoofing-aware speaker verification (Track 2). While a closed-condition system for Track 1 yielded unsatisfactory results, the team achieved very competitive performance in open-condition settings for both tracks through an ensemble system leveraging self-supervised models and augmented training data.

Abstract

This paper presents the work carried out by the ASASVIcomtech team, made up of researchers from Vicomtech and University of Granada, for the ASVspoof5 Challenge. The team has participated in both Track 1 (speech deepfake detection) and Track 2 (spoofing-aware speaker verification). This work started with an analysis of the challenge available data, which was regarded as an essential step to avoid later potential biases of the trained models, and whose main conclusions are presented here. With respect to the proposed approaches, a closed-condition system employing a deep complex convolutional recurrent architecture was developed for Track 1, although, unfortunately, no noteworthy results were achieved. On the other hand, different possibilities of open-condition systems, based on leveraging self-supervised models, augmented training data from previous challenges, and novel vocoders, were explored for both tracks, finally achieving very competitive results with an ensemble system.


Key findings
The closed-condition system for Track 1 showed unsatisfactory performance (28.41% EER). However, the open-condition ensemble system achieved competitive results, with a 5.02% EER for Track 1, demonstrating robustness against various spoofing attacks and codecs, although performance degraded for certain specific attacks (A28) and narrowband codec conditions (C07, C10). For Track 2, the combined ASV-CM system achieved a competitive min a-DCF of 0.1295.
Approach
For Track 1 (deepfake detection) in open conditions, the team used an ensemble system based on Wav2Vec2-Large and WavLM-Base self-supervised models as feature extractors, with downstream classifiers fine-tuned on ASVspoof5 data augmented with ASVspoof 2019 and vocoded data. For Track 2 (spoofing-aware speaker verification), their approach combined this ensemble deepfake detection system with a pre-trained TitaNet-Large ASV model via non-linear score fusion of calibrated log-likelihood ratios.
Datasets
ASVspoof5, ASVspoof 2019 (training and development data), Voc.v4 (additional vocoded data), LibriSpeech, VoxCeleb 1 and 2, NIST SRE 04–08, Fisher, Switchboard.
Model(s)
DCCRN (Deep Complex Convolutional Recurrent Network) for closed-condition Track 1; Wav2Vec2-Large, WavLM-Base (self-supervised models) with custom downstream classifiers for open-condition Track 1; TitaNet-Large (ASV model) for Track 2.
Author countries
Spain