Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Authors: Yi Zhu, Chirag Goel, Surya Koppisetti, Trang Tran, Ankur Kumar, Gaurav Bharaj

Published: 2024-10-09 18:55:28+00:00

AI Summary

This paper presents Reality Defender's submission to the ASVspoof5 challenge, focusing on a novel pretraining strategy called SLIM. SLIM leverages self-supervised contrastive learning to learn style-linguistics dependency embeddings from bonafide speech, improving generalizability and maintaining low computational cost.

Abstract

Audio deepfake detection is crucial to combat the malicious use of AI-synthesized speech. Among many efforts undertaken by the community, the ASVspoof challenge has become one of the benchmarks to evaluate the generalizability and robustness of detection models. In this paper, we present Reality Defender's submission to the ASVspoof5 challenge, highlighting a novel pretraining strategy which significantly improves generalizability while maintaining low computational cost during training. Our system SLIM learns the style-linguistics dependency embeddings from various types of bonafide speech using self-supervised contrastive learning. The learned embeddings help to discriminate spoof from bonafide speech by focusing on the relationship between the style and linguistics aspects. We evaluated our system on ASVspoof5, ASV2019, and In-the-wild. Our submission achieved minDCF of 0.1499 and EER of 5.5% on ASVspoof5 Track 1, and EER of 7.4% and 10.8% on ASV2019 and In-the-wild respectively.


Key findings
SLIM achieved a minDCF of 0.1499 and EER of 5.5% on ASVspoof5 Track 1. The model showed good generalization to out-of-domain datasets (ASV2019 and In-the-wild), indicating the effectiveness of the pretraining strategy. Performance degradation was observed with codec-applied data, suggesting further improvements could be made by incorporating codec-specific data augmentations.
Approach
The SLIM system uses a two-stage training process. The first stage uses self-supervised contrastive learning on bonafide speech to learn style-linguistics dependency embeddings. The second stage uses these embeddings along with raw SSL embeddings and a classifier to discriminate between bonafide and spoof audio.
Datasets
ASVspoof5 (train, dev, eval), ASV2019 Logical Access (LA), In-the-wild (ITW), CommonVoice, RAVDESS
Model(s)
WavLM-Base (as SSL backbone), a custom projector network, and a classifier (ASP + MLPs)
Author countries
USA, Canada