Bridging the Spoof Gap: A Unified Parallel Aggregation Network for Voice Presentation Attacks

View on arXiv ← Back to list

Authors: Awais Khan, Khalid Mahmood Malik

Published: 2023-09-19 12:12:59+00:00

AI Summary

This paper proposes a Parallel Stacked Aggregation Network (PSA) for unified voice spoofing detection, addressing the gap in existing research that tackles logical and physical attacks separately. The PSA network processes raw audio using a split-transform-aggregation technique to identify both logical and physical attacks, outperforming state-of-the-art solutions with reduced Equal Error Rate (EER) disparities.

Abstract

Automatic Speaker Verification (ASV) systems are increasingly used in voice bio-metrics for user authentication but are susceptible to logical and physical spoofing attacks, posing security risks. Existing research mainly tackles logical or physical attacks separately, leading to a gap in unified spoofing detection. Moreover, when existing systems attempt to handle both types of attacks, they often exhibit significant disparities in the Equal Error Rate (EER). To bridge this gap, we present a Parallel Stacked Aggregation Network that processes raw audio. Our approach employs a split-transform-aggregation technique, dividing utterances into convolved representations, applying transformations, and aggregating the results to identify logical (LA) and physical (PA) spoofing attacks. Evaluation of the ASVspoof-2019 and VSDC datasets shows the effectiveness of the proposed system. It outperforms state-of-the-art solutions, displaying reduced EER disparities and superior performance in detecting spoofing attacks. This highlights the proposed method's generalizability and superiority. In a world increasingly reliant on voice-based security, our unified spoofing detection system provides a robust defense against a spectrum of voice spoofing attacks, safeguarding ASVs and user data effectively.

Key findings

The proposed PSA network significantly outperforms state-of-the-art methods in detecting both logical and physical voice spoofing attacks, achieving reduced EER disparities. The model's performance is superior when processing raw audio compared to handcrafted features. The study also reveals the effectiveness of data augmentation in improving the model's generalization capabilities.

Approach

The authors address the problem of unified voice spoofing detection by designing a Parallel Stacked Aggregation Network (PSA). This network processes raw audio waveforms, employing a split-transform-aggregation technique that divides utterances, applies transformations, and aggregates results to classify logical and physical spoofing attacks. The PSA network incorporates squeeze and excitation blocks to enhance feature extraction.

Datasets

ASVspoof-2019 and VSDC datasets

Model(s)

Parallel Stacked Aggregation Network (PSA) with Squeeze and Excitation (SE) blocks; ResNet architectures with varying layer depths were also explored for comparison.

Author countries

USA

← Previous