Securing Voice Biometrics: One-Shot Learning Approach for Audio Deepfake Detection

Authors: Awais Khan, Khalid Mahmood Malik

Published: 2023-10-05 19:30:22+00:00

AI Summary

This paper introduces Quick-SpoofNet, a novel one-shot and metric learning approach for detecting both seen and unseen audio deepfake attacks in Automatic Speaker Verification (ASV) systems. It extracts compact temporal embeddings from voice samples using effective spectral features and employs triplet loss to distinguish bona fide speeches from spoofing attacks based on similarity indexing. The system demonstrates enhanced generalization capabilities against unseen deepfakes and bona fide speech across various datasets.

Abstract

The Automatic Speaker Verification (ASV) system is vulnerable to fraudulent activities using audio deepfakes, also known as logical-access voice spoofing attacks. These deepfakes pose a concerning threat to voice biometrics due to recent advancements in generative AI and speech synthesis technologies. While several deep learning models for speech synthesis detection have been developed, most of them show poor generalizability, especially when the attacks have different statistical distributions from the ones seen. Therefore, this paper presents Quick-SpoofNet, an approach for detecting both seen and unseen synthetic attacks in the ASV system using one-shot learning and metric learning techniques. By using the effective spectral feature set, the proposed method extracts compact and representative temporal embeddings from the voice samples and utilizes metric learning and triplet loss to assess the similarity index and distinguish different embeddings. The system effectively clusters similar speech embeddings, classifying bona fide speeches as the target class and identifying other clusters as spoofing attacks. The proposed system is evaluated using the ASVspoof 2019 logical access (LA) dataset and tested against unseen deepfake attacks from the ASVspoof 2021 dataset. Additionally, its generalization ability towards unseen bona fide speech is assessed using speech data from the VSDC dataset.


Key findings
Quick-SpoofNet achieved a low Equal Error Rate (EER) of 0.50% and 98.5% accuracy on the ASVspoof2019-LA dataset for seen attacks, outperforming state-of-the-art feature-fusion methods. It demonstrated robust generalization, effectively detecting unseen spoofing attacks from ASVspoof2021 with 86.41% accuracy (when bona fide samples were from ASVspoof2019) and identifying unseen bona fide speakers with 98.50% accuracy (when spoofing samples were from ASVspoof2019). However, its accuracy declined slightly to 77.63% when evaluating both unseen genuine (VSDC) and unseen spoofing (ASVspoof2021) samples together.
Approach
The Quick-SpoofNet approach leverages one-shot learning and metric learning within a Siamese network architecture. It extracts robust spectral features (log Mel-spectrogram, spectral envelope, spectral contrast) from audio samples, which are then fed into shared LSTM layers to generate temporal embeddings. Triplet loss is used to train the network, ensuring closer embeddings for anchor-positive pairs than anchor-negative pairs, and Euclidean distance is employed to classify query samples as bona fide or spoofed.
Datasets
ASVspoof 2019 logical access (LA) dataset, ASVspoof 2021-DF dataset, VSDC-0PR dataset
Model(s)
Quick-SpoofNet (Siamese network architecture with shared LSTM layers), consisting of two LSTM layers with 64 nodes followed by dense layers (512, 256, 128) and utilizing a triplet loss function.
Author countries
USA