Securing Voice Biometrics: One-Shot Learning Approach for Audio Deepfake Detection

Authors: Awais Khan, Khalid Mahmood Malik

Published: 2023-10-05 19:30:22+00:00

AI Summary

This paper introduces Quick-SpoofNet, a one-shot learning and metric learning approach for audio deepfake detection. It uses a robust spectral feature set and a Siamese LSTM network to generate temporal embeddings, effectively classifying bona fide and spoofed speech even for unseen attacks.

Abstract

The Automatic Speaker Verification (ASV) system is vulnerable to fraudulent activities using audio deepfakes, also known as logical-access voice spoofing attacks. These deepfakes pose a concerning threat to voice biometrics due to recent advancements in generative AI and speech synthesis technologies. While several deep learning models for speech synthesis detection have been developed, most of them show poor generalizability, especially when the attacks have different statistical distributions from the ones seen. Therefore, this paper presents Quick-SpoofNet, an approach for detecting both seen and unseen synthetic attacks in the ASV system using one-shot learning and metric learning techniques. By using the effective spectral feature set, the proposed method extracts compact and representative temporal embeddings from the voice samples and utilizes metric learning and triplet loss to assess the similarity index and distinguish different embeddings. The system effectively clusters similar speech embeddings, classifying bona fide speeches as the target class and identifying other clusters as spoofing attacks. The proposed system is evaluated using the ASVspoof 2019 logical access (LA) dataset and tested against unseen deepfake attacks from the ASVspoof 2021 dataset. Additionally, its generalization ability towards unseen bona fide speech is assessed using speech data from the VSDC dataset.


Key findings
Quick-SpoofNet achieved an EER of 0.50% on the ASVspoof 2019 dataset for seen attacks and demonstrated good generalization ability on unseen attacks from ASVspoof 2021 and VSDC datasets. While performance decreased slightly when tested on unseen bona fide speech from a different dataset, it significantly outperformed other state-of-the-art methods.
Approach
Quick-SpoofNet uses one-shot learning and metric learning to detect audio deepfakes. It extracts spectral features (Mel-spectrogram, spectral envelope, spectral contrast) and employs a Siamese LSTM network with triplet loss to generate temporal embeddings. The similarity between embeddings is then used to classify audio samples as bona fide or spoofed.
Datasets
ASVspoof 2019 logical access (LA) dataset, ASVspoof 2021 dataset, VSDC dataset
Model(s)
Siamese LSTM network with triplet loss
Author countries
USA