Frequency-mix Knowledge Distillation for Fake Speech Detection

View on arXiv ← Back to list

Authors: Cunhang Fan, Shunbo Dong, Jun Xue, Yujie Chen, Jiangyan Yi, Zhao Lv

Published: 2024-06-14 02:25:16+00:00

AI Summary

This paper proposes Frequency-mix knowledge distillation (FKD) for fake speech detection, addressing information loss in existing data augmentation methods. FKD uses a teacher model trained on frequency-mixed data and a student model trained on time-domain augmented data, with multi-level feature distillation to improve information extraction and generalization.

Abstract

In the telephony scenarios, the fake speech detection (FSD) task to combat speech spoofing attacks is challenging. Data augmentation (DA) methods are considered effective means to address the FSD task in telephony scenarios, typically divided into time domain and frequency domain stages. While each has its advantages, both can result in information loss. To tackle this issue, we propose a novel DA method, Frequency-mix (Freqmix), and introduce the Freqmix knowledge distillation (FKD) to enhance model information extraction and generalization abilities. Specifically, we use Freqmix-enhanced data as input for the teacher model, while the student model's input undergoes time-domain DA method. We use a multi-level feature distillation approach to restore information and improve the model's generalization capabilities. Our approach achieves state-of-the-art results on ASVspoof 2021 LA dataset, showing a 31% improvement over baseline and performs competitively on ASVspoof 2021 DF dataset.

Key findings

FKD achieved state-of-the-art results on the ASVspoof 2021 LA dataset, showing a 31% improvement over the baseline. The approach also performed competitively on the ASVspoof 2021 DF dataset, outperforming several other single systems and achieving comparable results to complex fusion systems.

Approach

The authors propose Freqmix, a frequency-domain data augmentation method, combined with knowledge distillation. A teacher model uses Freqmix-enhanced data, while a student model uses time-domain augmented data. Multi-level feature distillation transfers knowledge from the teacher to the student model, enhancing performance.

Datasets

ASVspoof 2021 LA and ASVspoof 2021 DF datasets

Model(s)

MPIF-Res2Net

Author countries

China

← Previous