Towards Robust Audio Deepfake Detection: A Evolving Benchmark for Continual Learning

Authors: Xiaohui Zhang, Jiangyan Yi, Jianhua Tao

Published: 2024-05-14 13:37:13+00:00

AI Summary

This paper introduces EVDA, a benchmark for evaluating continual learning methods in deepfake audio detection. EVDA addresses the challenge of traditional methods struggling to adapt to evolving synthetic speech by incorporating classic and newly generated deepfake audio datasets and supporting various continual learning techniques.

Abstract

The rise of advanced large language models such as GPT-4, GPT-4o, and the Claude family has made fake audio detection increasingly challenging. Traditional fine-tuning methods struggle to keep pace with the evolving landscape of synthetic speech, necessitating continual learning approaches that can adapt to new audio while retaining the ability to detect older types. Continual learning, which acts as an effective tool for detecting newly emerged deepfake audio while maintaining performance on older types, lacks a well-constructed and user-friendly evaluation framework. To address this gap, we introduce EVDA, a benchmark for evaluating continual learning methods in deepfake audio detection. EVDA includes classic datasets from the Anti-Spoofing Voice series, Chinese fake audio detection series, and newly generated deepfake audio from models like GPT-4 and GPT-4o. It supports various continual learning techniques, such as Elastic Weight Consolidation (EWC), Learning without Forgetting (LwF), and recent methods like Regularized Adaptive Weight Modification (RAWM) and Radian Weight Modification (RWM). Additionally, EVDA facilitates the development of robust algorithms by providing an open interface for integrating new continual learning methods


Key findings
Replay and EWC consistently demonstrated the most competitive performance across all tasks in EVDA, achieving the lowest average EER. Other methods showed varying performance, highlighting the importance of selecting appropriate continual learning strategies for robust deepfake audio detection. EWC demonstrated better performance on older tasks, showing its effectiveness in mitigating forgetting.
Approach
The authors address the problem of evolving deepfake audio detection by creating EVDA, a benchmark that includes datasets from various sources and supports multiple continual learning methods like EWC, LwF, RAWM, and RWM. This allows for evaluating the robustness and adaptability of different continual learning approaches to new audio deepfakes while maintaining detection of older ones.
Datasets
Anti-Spoofing Voice series, Chinese fake audio detection series (ADD 2022, ADD 2023), datasets generated using GPT-4 and GPT-4o, FMFCC-A, In-the-Wild, ASVspoof2015, ASVspoof2019LA, ASVspoof2021LA, FoR, HAD
Model(s)
5 linear layers model with 128-hidden dimension
Author countries
China