Region-Based Optimization in Continual Learning for Audio Deepfake Detection

View on arXiv ← Back to list

Authors: Yujie Chen, Jiangyan Yi, Cunhang Fan, Jianhua Tao, Yong Ren, Siding Zeng, Chu Yuan Zhang, Xinrui Yan, Hao Gu, Jun Xue, Chenglong Wang, Zhao Lv, Xiaohui Zhang

Published: 2024-12-16 08:34:09+00:00

AI Summary

This paper introduces RegO, a continual learning method for audio deepfake detection that utilizes the Fisher information matrix to partition the neural network into four regions for region-adaptive gradient optimization. This approach, combined with an Ebbinghaus forgetting mechanism, improves the model's ability to adapt to new deepfake audio while retaining performance on previously learned data.

Abstract

Rapid advancements in speech synthesis and voice conversion bring convenience but also new security risks, creating an urgent need for effective audio deepfake detection. Although current models perform well, their effectiveness diminishes when confronted with the diverse and evolving nature of real-world deepfakes. To address this issue, we propose a continual learning method named Region-Based Optimization (RegO) for audio deepfake detection. Specifically, we use the Fisher information matrix to measure important neuron regions for real and fake audio detection, dividing them into four regions. First, we directly fine-tune the less important regions to quickly adapt to new tasks. Next, we apply gradient optimization in parallel for regions important only to real audio detection, and in orthogonal directions for regions important only to fake audio detection. For regions that are important to both, we use sample proportion-based adaptive gradient optimization. This region-adaptive optimization ensures an appropriate trade-off between memory stability and learning plasticity. Additionally, to address the increase of redundant neurons from old tasks, we further introduce the Ebbinghaus forgetting mechanism to release them, thereby promoting the capability of the model to learn more generalized discriminative features. Experimental results show our method achieves a 21.3% improvement in EER over the state-of-the-art continual learning approach RWM for audio deepfake detection. Moreover, the effectiveness of RegO extends beyond the audio deepfake detection domain, showing potential significance in other tasks, such as image recognition. The code is available at https://github.com/cyjie429/RegO

Key findings

RegO achieved a 21.3% improvement in EER over the state-of-the-art continual learning approach RWM for audio deepfake detection. The method also showed competitive results in image recognition, suggesting broader applicability. Ablation studies confirmed the effectiveness of the proposed modules.

Approach

RegO uses the Fisher information matrix to identify important neuron regions for real and fake audio detection. It then applies region-adaptive gradient optimization, including parallel updates for real audio and orthogonal updates for fake audio, and adaptive updates for regions important to both. A neuron forgetting mechanism based on the Ebbinghaus forgetting curve removes redundant neurons.

Datasets

Evolving Deepfake Audio (EVDA) benchmark (including FMFCC, In the Wild, ADD 2022, ASVspoof2015, ASVspoof2019, ASVspoof2021, FoR, and HAD datasets); CLEAR benchmark for general study

Model(s)

Wav2vec 2.0 (XLSR-53) as feature extractor; 5-layer SimpleMlp as backend for audio deepfake detection; ResNet-50 as feature extractor for image recognition.

Author countries

China

← Previous