One-Class Knowledge Distillation for Spoofing Speech Detection

Authors: Jingze Lu, Yuxiang Zhang, Wenchao Wang, Zengqiang Shang, Pengyuan Zhang

Published: 2023-09-15 09:59:06+00:00

AI Summary

This paper proposes a one-class knowledge distillation (OCKD) method for spoofing speech detection that addresses the generalization limitations of traditional binary classification approaches. OCKD uses a teacher-student framework, where a teacher model trained on both bonafide and spoofed speech guides a student model trained only on bonafide speech, resulting in improved performance on unseen spoofing attacks.

Abstract

The detection of spoofing speech generated by unseen algorithms remains an unresolved challenge. One reason for the lack of generalization ability is traditional detecting systems follow the binary classification paradigm, which inherently assumes the possession of prior knowledge of spoofing speech. One-class methods attempt to learn the distribution of bonafide speech and are inherently suited to the task where spoofing speech exhibits significant differences. However, training a one-class system using only bonafide speech is challenging. In this paper, we introduce a teacher-student framework to provide guidance for the training of a one-class model. The proposed one-class knowledge distillation method outperforms other state-of-the-art methods on the ASVspoof 21DF dataset and InTheWild dataset, which demonstrates its superior generalization ability.


Key findings
The proposed OCKD method outperforms state-of-the-art methods on the ASVspoof 21DF and InTheWild datasets, demonstrating superior generalization ability to unseen spoofing attacks. The cosine similarity loss function proved more effective than MSE loss in knowledge distillation. The overall pooled Equal Error Rate (EER) was reduced from 6.36% to 5.88%.
Approach
The authors address the problem of generalizing to unseen spoofing attacks by using a one-class knowledge distillation approach. A teacher model, trained on both bonafide and spoofed speech, guides a student model trained only on bonafide speech using a loss function based on cosine similarity and mean squared error between their feature embeddings. This allows the student model to learn the distribution of bonafide speech and generalize better to unseen attacks.
Datasets
ASVspoof 2019 LA (training), ASVspoof 2021 LA (evaluation), ASVspoof 2021 DeepFake (evaluation), ASVspoof 2021 LA hidden track (evaluation), ASVspoof 2021 DF hidden track (evaluation), InTheWild (evaluation)
Model(s)
Teacher Model: Wav2Vec 2.0 (24 Transformer layers) + AASIST backend; Student Model: Wav2Vec 2.0 (8 Transformer layers) + AASIST backend
Author countries
China