Learning From Yourself: A Self-Distillation Method for Fake Speech Detection

View on arXiv ← Back to list

Authors: Jun Xue, Cunhang Fan, Jiangyan Yi, Chenglong Wang, Zhengqi Wen, Dan Zhang, Zhao Lv

Published: 2023-03-02 12:52:22+00:00

AI Summary

This paper introduces a novel self-distillation method for fake speech detection that enhances the performance of shallow networks without increasing model complexity. It achieves this by using a deep network as a teacher model to guide shallow networks, reducing feature differences and improving accuracy.

Abstract

In this paper, we propose a novel self-distillation method for fake speech detection (FSD), which can significantly improve the performance of FSD without increasing the model complexity. For FSD, some fine-grained information is very important, such as spectrogram defects, mute segments, and so on, which are often perceived by shallow networks. However, shallow networks have much noise, which can not capture this very well. To address this problem, we propose using the deepest network instruct shallow network for enhancing shallow networks. Specifically, the networks of FSD are divided into several segments, the deepest network being used as the teacher model, and all shallow networks become multiple student models by adding classifiers. Meanwhile, the distillation path between the deepest network feature and shallow network features is used to reduce the feature difference. A series of experimental results on the ASVspoof 2019 LA and PA datasets show the effectiveness of the proposed method, with significant improvements compared to the baseline.

Key findings

The self-distillation method significantly improves fake speech detection performance compared to baseline models across different network depths and architectures on both ASVspoof 2019 datasets. The method demonstrates robustness and generalizability, achieving state-of-the-art results in comparison to other single systems.

Approach

The approach divides the FSD network into segments, using the deepest network as a teacher model and shallower networks as student models. It employs a three-component loss function (hard loss, feature loss, soft loss) during training to guide the student models using knowledge distillation in both feature and prediction dimensions.

Datasets

ASVspoof 2019 LA and PA datasets

Model(s)

ECANet and SENet architectures

Author countries

China

← Previous