Bridging the Spoof Gap: A Unified Parallel Aggregation Network for Voice Presentation Attacks

Authors: Awais Khan, Khalid Mahmood Malik

Published: 2023-09-19 12:12:59+00:00

AI Summary

This paper introduces a Parallel Stacked Aggregation Network to bridge the gap in unified spoofing detection for Automatic Speaker Verification (ASV) systems, which are vulnerable to both logical (LA) and physical (PA) attacks. The proposed approach directly processes raw audio using a split-transform-aggregation technique to identify spoofing attacks. It significantly outperforms state-of-the-art solutions on ASVspoof-2019 and VSDC datasets, showing reduced Equal Error Rate (EER) disparities and superior generalizability across attack types.

Abstract

Automatic Speaker Verification (ASV) systems are increasingly used in voice bio-metrics for user authentication but are susceptible to logical and physical spoofing attacks, posing security risks. Existing research mainly tackles logical or physical attacks separately, leading to a gap in unified spoofing detection. Moreover, when existing systems attempt to handle both types of attacks, they often exhibit significant disparities in the Equal Error Rate (EER). To bridge this gap, we present a Parallel Stacked Aggregation Network that processes raw audio. Our approach employs a split-transform-aggregation technique, dividing utterances into convolved representations, applying transformations, and aggregating the results to identify logical (LA) and physical (PA) spoofing attacks. Evaluation of the ASVspoof-2019 and VSDC datasets shows the effectiveness of the proposed system. It outperforms state-of-the-art solutions, displaying reduced EER disparities and superior performance in detecting spoofing attacks. This highlights the proposed method's generalizability and superiority. In a world increasingly reliant on voice-based security, our unified spoofing detection system provides a robust defense against a spectrum of voice spoofing attacks, safeguarding ASVs and user data effectively.


Key findings
The proposed system achieved an EER of 3.04% and min t-DCF of 0.087 for LA attacks, and 1.26% EER and 0.038% min t-DCF for PA attacks on ASVspoof2019 (with augmentation), outperforming twelve individual and seven unified state-of-the-art solutions. It effectively reduced EER disparities between LA and PA attacks, demonstrating superior performance in detecting various spoofing attacks. The network performed better with raw waveforms compared to traditional handcrafted features.
Approach
The paper introduces a SE-Parallel Stack Aggregation (SE-PSA) Network that processes raw audio directly, eliminating the need for spectrograms or handcrafted features. It employs a split-transform-aggregation technique, drawing inspiration from Inception networks and ResNeXt, combined with Squeeze and Excitation (SE) blocks and spatial dropout, to extract multi-level acoustic cues for classifying logical and physical spoofing attacks.
Datasets
ASVspoof-2019, VSDC (Voice Spoofing Detection Corpus)
Model(s)
SE-Parallel Stacked Aggregation (SE-PSA) Network, which integrates concepts from ResNeXt (intra-architecture), Inception networks (split-transform-merge), VGG-Net, ResNets (repeating residual layers), and Squeeze and Excitation (SE) blocks.
Author countries
USA