Naturalness-Aware Curriculum Learning with Dynamic Temperature for Speech Deepfake Detection

Authors: Taewoo Kim, Guisik Kim, Choongsang Cho, Young Han Lee

Published: 2025-05-20 06:15:17+00:00

AI Summary

This research proposes naturalness-aware curriculum learning for speech deepfake detection, a training framework that leverages speech naturalness (measured by mean opinion scores) to improve model robustness and generalization. The approach incorporates dynamic temperature scaling based on speech naturalness, resulting in a 23% relative reduction in EER on the ASVspoof 2021 DF dataset.

Abstract

Recent advances in speech deepfake detection (SDD) have significantly improved artifacts-based detection in spoofed speech. However, most models overlook speech naturalness, a crucial cue for distinguishing bona fide speech from spoofed speech. This study proposes naturalness-aware curriculum learning, a novel training framework that leverages speech naturalness to enhance the robustness and generalization of SDD. This approach measures sample difficulty using both ground-truth labels and mean opinion scores, and adjusts the training schedule to progressively introduce more challenging samples. To further improve generalization, a dynamic temperature scaling method based on speech naturalness is incorporated into the training process. A 23% relative reduction in the EER was achieved in the experiments on the ASVspoof 2021 DF dataset, without modifying the model architecture. Ablation studies confirmed the effectiveness of naturalness-aware training strategies for SDD tasks.


Key findings
The proposed naturalness-aware curriculum learning with dynamic temperature scaling significantly improved the equal error rate (EER) on the ASVspoof 2021 dataset. The method achieved the lowest EER among state-of-the-art models on both ASVspoof 2021 LA and DF datasets and the In-The-Wild dataset. Ablation studies confirmed the effectiveness of both curriculum learning and dynamic temperature scaling.
Approach
The authors address the problem by introducing a naturalness-aware curriculum learning framework. This framework uses mean opinion scores (MOS) to assess sample difficulty, progressively introducing harder samples during training. A dynamic temperature scaling method, also based on MOS, further improves generalization by adjusting the softmax temperature.
Datasets
ASVspoof 2019 logical access (LA) dataset, ASVspoof 2021 LA and DeepFake (DF) datasets, In-The-Wild dataset
Model(s)
XLS-R AASIST, XLS-R Transformer, XLS-R Conformer (Conformer-based model with pre-trained XLS-R 300M model for feature extraction)
Author countries
South Korea