Freeze and Learn: Continual Learning with Selective Freezing for Speech Deepfake Detection

Authors: Davide Salvi, Viola Negroni, Luca Bondi, Paolo Bestagini, Stefano Tubaro

Published: 2024-09-26 07:27:51+00:00

AI Summary

This paper investigates the optimal application of continual learning for speech deepfake detection. It compares retraining an entire model versus selectively updating only initial layers (responsible for feature processing) while freezing others. Results show that selectively updating initial layers is the most effective strategy for maintaining model performance while adapting to new data.

Abstract

In speech deepfake detection, one of the critical aspects is developing detectors able to generalize on unseen data and distinguish fake signals across different datasets. Common approaches to this challenge involve incorporating diverse data into the training process or fine-tuning models on unseen datasets. However, these solutions can be computationally demanding and may lead to the loss of knowledge acquired from previously learned data. Continual learning techniques offer a potential solution to this problem, allowing the models to learn from unseen data without losing what they have already learned. Still, the optimal way to apply these algorithms for speech deepfake detection remains unclear, and we do not know which is the best way to apply these algorithms to the developed models. In this paper we address this aspect and investigate whether, when retraining a speech deepfake detector, it is more effective to apply continual learning across the entire model or to update only some of its layers while freezing others. Our findings, validated across multiple models, indicate that the most effective approach among the analyzed ones is to update only the weights of the initial layers, which are responsible for processing the input features of the detector.


Key findings
Updating only the initial layers (encoder) of the model during continual learning is significantly more effective than retraining the entire model or only the classifier. This approach minimizes catastrophic forgetting and maintains higher accuracy on previously seen datasets while adapting to new ones. The Train-on-All approach achieved the highest performance, serving as a benchmark.
Approach
The authors employ continual learning with selective freezing. They divide a speech deepfake detector into an encoder and classifier module. They compare retraining both modules, only the encoder, and only the classifier when retraining on new datasets, using the DFWF continual learning method.
Datasets
ASVspoof 2019, FakeOrReal, In-the-Wild, Purdue speech dataset
Model(s)
RawNet2, LCNN
Author countries
Italy, USA