Vision Graph Non-Contrastive Learning for Audio Deepfake Detection with Limited Labels

Authors: Falih Gozi Febrinanto, Kristen Moore, Chandra Thapa, Jiangang Ma, Vidya Saikrishna, Feng Xia

Published: 2025-01-09 03:18:27+00:00

AI Summary

This paper introduces SIGNL, a novel framework for audio deepfake detection that uses spatio-temporal vision graph non-contrastive learning to achieve high performance with limited labeled data. SIGNL constructs graphs from audio spectrograms, pre-trains encoders using label-free learning, and fine-tunes them for deepfake detection, significantly outperforming state-of-the-art methods.

Abstract

Recent advancements in audio deepfake detection have leveraged graph neural networks (GNNs) to model frequency and temporal interdependencies in audio data, effectively identifying deepfake artifacts. However, the reliance of GNN-based methods on substantial labeled data for graph construction and robust performance limits their applicability in scenarios with limited labeled data. Although vast amounts of audio data exist, the process of labeling samples as genuine or fake remains labor-intensive and costly. To address this challenge, we propose SIGNL (Spatio-temporal vIsion Graph Non-contrastive Learning), a novel framework that maintains high GNN performance in low-label settings. SIGNL constructs spatio-temporal graphs by representing patches from the audio's visual spectrogram as nodes. These graph structures are modeled using vision graph convolutional (GC) encoders pre-trained through graph non-contrastive learning, a label-free that maximizes the similarity between positive pairs. The pre-trained encoders are then fine-tuned for audio deepfake detection, reducing reliance on labeled data. Experiments demonstrate that SIGNL outperforms state-of-the-art baselines across multiple audio deepfake detection datasets, achieving the lowest Equal Error Rate (EER) with as little as 5% labeled data. Additionally, SIGNL exhibits strong cross-domain generalization, achieving the lowest EER in evaluations involving diverse attack types and languages in the In-The-Wild dataset.


Key findings
SIGNL outperforms state-of-the-art baselines across multiple datasets, achieving the lowest Equal Error Rate (EER) with as little as 5% labeled data. It also exhibits strong cross-domain generalization, achieving low EERs even with diverse attack types and languages in the In-The-Wild dataset. The combination of multiple graph augmentations further improves performance.
Approach
SIGNL constructs spatio-temporal graphs from patches of an audio spectrogram. Vision graph convolutional encoders are pre-trained using graph non-contrastive learning on unlabeled data, maximizing similarity between positive pairs. These pre-trained encoders are then fine-tuned for audio deepfake detection.
Datasets
ASVspoof 2021 DF, ASVspoof 5, Chinese fake audio detection (CFAD), In-The-Wild
Model(s)
Vision Graph Convolutional Networks (GCNs), Wav2Vec2 (for feature extraction)
Author countries
Australia, Australia, Australia, Australia, Australia