One-Class Learning with Adaptive Centroid Shift for Audio Deepfake Detection

Authors: Hyun Myung Kim, Kangwook Jang, Hoirin Kim

Published: 2024-06-24 15:21:50+00:00

Comment: Accepted by Interspeech 2024

AI Summary

This paper introduces Adaptive Centroid Shift (ACS), a novel method for one-class learning in audio deepfake detection. ACS continuously updates a bonafide centroid using only genuine speech samples, creating a tightly clustered representation for authentic audio while pushing spoofed audio further away. This approach significantly enhances the model's generalization ability against unseen deepfake attacks.

Abstract

As speech synthesis systems continue to make remarkable advances in recent years, the importance of robust deepfake detection systems that perform well in unseen systems has grown. In this paper, we propose a novel adaptive centroid shift (ACS) method that updates the centroid representation by continually shifting as the weighted average of bonafide representations. Our approach uses only bonafide samples to define their centroid, which can yield a specialized centroid for one-class learning. Integrating our ACS with one-class learning gathers bonafide representations into a single cluster, forming well-separated embeddings robust to unseen spoofing attacks. Our proposed method achieves an equal error rate (EER) of 2.19% on the ASVspoof 2021 deepfake dataset, outperforming all existing systems. Furthermore, the t-SNE visualization illustrates that our method effectively maps the bonafide embeddings into a single cluster and successfully disentangles the bonafide and spoof classes.


Key findings
The proposed method achieved a state-of-the-art Equal Error Rate (EER) of 2.19% on the ASVspoof 2021 DF dataset and 0.17% on the ASVspoof 2019 LA dataset, outperforming all compared existing systems. It also showed competitive performance on ASVspoof 2021 LA with an EER of 1.30%. t-SNE visualizations further demonstrated that ACS effectively maps bonafide embeddings into a single, well-separated cluster.
Approach
The method employs a pre-trained XLS-R model as a feature encoder and Attentive Statistics Pooling (ASP) to derive utterance-level embeddings. It then uses a one-class learning framework with a novel Adaptive Centroid Shift (ACS) mechanism, which updates a centroid vector based solely on bonafide samples. A cosine distance-based loss function is used to bring bonafide samples closer to this specialized centroid while pushing fake samples away.
Datasets
ASVspoof 2019 LA (for training/validation), ASVspoof 2019 LA (19LA), ASVspoof 2021 LA (21LA), ASVspoof 2021 DF (21DF)
Model(s)
XLS-R (feature encoder), Attentive Statistics Pooling (ASP), One-class learning with Adaptive Centroid Shift (ACS)
Author countries
South Korea