Towards Attention-based Contrastive Learning for Audio Spoof Detection

View on arXiv ← Back to list

Authors: Chirag Goel, Surya Koppisetti, Ben Colman, Ali Shahriyari, Gaurav Bharaj

Published: 2024-07-03 21:25:12+00:00

AI Summary

This paper introduces an attention-based contrastive learning framework (SSAST-CL) for audio spoof detection using Vision Transformers (ViTs). SSAST-CL improves upon a baseline ViT model by incorporating cross-attention to enhance representation learning, achieving competitive performance on the ASVSpoof 2021 challenge.

Abstract

Vision transformers (ViT) have made substantial progress for classification tasks in computer vision. Recently, Gong et. al. '21, introduced attention-based modeling for several audio tasks. However, relatively unexplored is the use of a ViT for audio spoof detection task. We bridge this gap and introduce ViTs for this task. A vanilla baseline built on fine-tuning the SSAST (Gong et. al. '22) audio ViT model achieves sub-optimal equal error rates (EERs). To improve performance, we propose a novel attention-based contrastive learning framework (SSAST-CL) that uses cross-attention to aid the representation learning. Experiments show that our framework successfully disentangles the bonafide and spoof classes and helps learn better classifiers for the task. With appropriate data augmentations policy, a model trained on our framework achieves competitive performance on the ASVSpoof 2021 challenge. We provide comparisons and ablation studies to justify our claim.

Key findings

The proposed SSAST-CL framework significantly outperforms a baseline ViT model trained with cross-entropy loss, achieving a competitive equal error rate (EER) of 4.74 on the ASVSpoof 2021 challenge. Data augmentation, particularly RawBoost and FIR filtering, plays a crucial role in achieving this performance. The inclusion of cross-attention further improves the EER.

Approach

The authors propose a two-stage approach. Stage I uses a Siamese network with a novel contrastive loss function that leverages both self-attention and cross-attention to learn discriminative representations of bonafide and spoof audio. Stage II trains an MLP classifier on these learned representations.

Datasets

ASVSpoof 2021 logical access (LA) challenge dataset (training set same as ASVSpoof2019), AudioSet, LibriSpeech

Model(s)

SSAST (Self-Supervised Audio Spectrogram Transformer) architecture adapted for contrastive learning (SSAST-CL), MLP classifier

Author countries

USA

← Previous