Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?

Authors: Yuankun Xie, Chenxu Xiong, Xiaopeng Wang, Zhiyong Wang, Yi Lu, Xin Qi, Ruibo Fu, Yukun Liu, Zhengqi Wen, Jianhua Tao, Guanjun Li, Long Ye

Published: 2024-08-20 13:45:34+00:00

AI Summary

This paper investigates the effectiveness of current deepfake audio detection models against audio generated by Audio Language Models (ALMs). The study evaluates state-of-the-art countermeasures on 12 types of ALM-generated audio, finding that codec-trained countermeasures achieve surprisingly high detection accuracy, exceeding expectations.

Abstract

Currently, Audio Language Models (ALMs) are rapidly advancing due to the developments in large language models and audio neural codecs. These ALMs have significantly lowered the barrier to creating deepfake audio, generating highly realistic and diverse types of deepfake audio, which pose severe threats to society. Consequently, effective audio deepfake detection technologies to detect ALM-based audio have become increasingly critical. This paper investigate the effectiveness of current countermeasure (CM) against ALM-based audio. Specifically, we collect 12 types of the latest ALM-based deepfake audio and utilizing the latest CMs to evaluate. Our findings reveal that the latest codec-trained CM can effectively detect ALM-based audio, achieving 0% equal error rate under most ALM test conditions, which exceeded our expectations. This indicates promising directions for future research in ALM-based deepfake audio detection.


Key findings
Codec-trained countermeasures achieved significantly better performance (mostly 0% EER) on the ALM-generated audio compared to vocoder-trained countermeasures. While effective against ALM-based audio, codec-trained models showed higher false negative rates on certain audio types, suggesting limitations in generalization to diverse audio types and real-world scenarios.
Approach
The researchers evaluated existing audio deepfake detection countermeasures (CMs) on a newly collected dataset of 12 different types of ALM-generated audio. They compared the performance of CMs trained on traditional vocoder-based datasets and those trained on a codec-based dataset (Codecfake). The equal error rate (EER) was used as the evaluation metric.
Datasets
ASVspoof2019LA, Codecfake, 12 types of ALM-based deepfake audio collected from demo pages of various ALM models (A01-A12), in-the-wild (ITW) dataset.
Model(s)
LCNN, AASIST, Wav2vec2-xls-r (features extracted, model frozen).
Author countries
China