Continuous Learning of Transformer-based Audio Deepfake Detection

Authors: Tuan Duy Nguyen Le, Kah Kuan Teh, Huy Dat Tran

Published: 2024-09-09 08:28:09+00:00

AI Summary

This paper presents a framework for audio deepfake detection that achieves high accuracy on existing data and adapts effectively to new fake data via continuous learning. It uses an Audio Spectrogram Transformer (AST) model, enhanced with data augmentation and a continuous learning plugin module that outperforms conventional fine-tuning.

Abstract

This paper proposes a novel framework for audio deepfake detection with two main objectives: i) attaining the highest possible accuracy on available fake data, and ii) effectively performing continuous learning on new fake data in a few-shot learning manner. Specifically, we conduct a large audio deepfake collection using various deep audio generation methods. The data is further enhanced with additional augmentation methods to increase variations amidst compressions, far-field recordings, noise, and other distortions. We then adopt the Audio Spectrogram Transformer for the audio deepfake detection model. Accordingly, the proposed method achieves promising performance on various benchmark datasets. Furthermore, we present a continuous learning plugin module to update the trained model most effectively with the fewest possible labeled data points of the new fake type. The proposed method outperforms the conventional direct fine-tuning approach with much fewer labeled data points.


Key findings
The proposed AST model achieved state-of-the-art performance on several benchmark datasets. The continuous learning plugin significantly improved detection accuracy for unseen datasets, achieving AUC improvements from 70+% to over 95% with a small fraction (0.1%) of new data, showcasing effectiveness compared to direct fine-tuning.
Approach
The authors use an Audio Spectrogram Transformer (AST) for deepfake detection, improving its robustness through data augmentation. A continuous learning plugin, incorporating gradient boosting, enables efficient updates with minimal labeled data of new fake types, initially using discriminative learning followed by fine-tuning with accumulated data.
Datasets
ASVspoof2019 LA evaluation dataset, FakeAVCeleb dataset, In-the-wild dataset, ASVSpoof 2021 (Logical Access and DeepFake parts), Fake or Real (FoR) dataset.
Model(s)
Audio Spectrogram Transformer (AST), XGBoost
Author countries
Vietnam, Singapore