One-Class Learning with Adaptive Centroid Shift for Audio Deepfake Detection

View on arXiv ← Back to list

Authors: Hyun Myung Kim, Kangwook Jang, Hoirin Kim

Published: 2024-06-24 15:21:50+00:00

AI Summary

This paper proposes a novel adaptive centroid shift (ACS) method for audio deepfake detection using one-class learning. ACS updates the centroid representation using only bonafide samples, creating a robust model against unseen spoofing attacks. The method achieves a state-of-the-art equal error rate (EER) of 2.19% on the ASVspoof 2021 deepfake dataset.

Abstract

As speech synthesis systems continue to make remarkable advances in recent years, the importance of robust deepfake detection systems that perform well in unseen systems has grown. In this paper, we propose a novel adaptive centroid shift (ACS) method that updates the centroid representation by continually shifting as the weighted average of bonafide representations. Our approach uses only bonafide samples to define their centroid, which can yield a specialized centroid for one-class learning. Integrating our ACS with one-class learning gathers bonafide representations into a single cluster, forming well-separated embeddings robust to unseen spoofing attacks. Our proposed method achieves an equal error rate (EER) of 2.19% on the ASVspoof 2021 deepfake dataset, outperforming all existing systems. Furthermore, the t-SNE visualization illustrates that our method effectively maps the bonafide embeddings into a single cluster and successfully disentangles the bonafide and spoof classes.

Key findings

The proposed method achieves state-of-the-art performance on the ASVspoof 2021 deepfake dataset with an EER of 2.19%, outperforming existing systems. t-SNE visualizations show effective clustering of bonafide embeddings and clear separation from spoofed embeddings. The one-class learning approach proves superior to binary classification methods for this task.

Approach

The authors use one-class learning with an adaptive centroid shift (ACS) method. ACS updates the centroid using a weighted average of bonafide samples only, making it less susceptible to the influence of spoofed samples. This specialized centroid is then used in a one-class loss function to separate bonafide and spoofed audio embeddings.

Datasets

ASVspoof 2019 LA, ASVspoof 2021 LA, ASVspoof 2021 DF

Model(s)

XLS-R (pre-trained speech foundation model) with Attentive Statistics Pooling (ASP)

Author countries

South Korea

← Previous