Frame-to-Utterance Convergence: A Spectra-Temporal Approach for Unified Spoofing Detection

Authors: Awais Khan, Khalid Mahmood Malik, Shah Nawaz

Published: 2023-09-18 14:54:42+00:00

AI Summary

This paper introduces a unified spectra-temporal approach for detecting various voice spoofing attacks, including synthetic, replay, and partial deepfakes. The method leverages frame-level spectral deviation coefficients (SDC) and utterance-level sequential temporal coefficients (STC) through a bi-LSTM network. These coefficients are then fused and processed by an auto-encoder to generate robust spectra-temporal deviated coefficients (STDC), demonstrating enhanced performance across diverse spoofing categories.

Abstract

Voice spoofing attacks pose a significant threat to automated speaker verification systems. Existing anti-spoofing methods often simulate specific attack types, such as synthetic or replay attacks. However, in real-world scenarios, the countermeasures are unaware of the generation schema of the attack, necessitating a unified solution. Current unified solutions struggle to detect spoofing artifacts, especially with recent spoofing mechanisms. For instance, the spoofing algorithms inject spectral or temporal anomalies, which are challenging to identify. To this end, we present a spectra-temporal fusion leveraging frame-level and utterance-level coefficients. We introduce a novel local spectral deviation coefficient (SDC) for frame-level inconsistencies and employ a bi-LSTM-based network for sequential temporal coefficients (STC), which capture utterance-level artifacts. Our spectra-temporal fusion strategy combines these coefficients, and an auto-encoder generates spectra-temporal deviated coefficients (STDC) to enhance robustness. Our proposed approach addresses multiple spoofing categories, including synthetic, replay, and partial deepfake attacks. Extensive evaluation on diverse datasets (ASVspoof2019, ASVspoof2021, VSDC, partial spoofs, and in-the-wild deepfakes) demonstrated its robustness for a wide range of voice applications.


Key findings
The proposed STDC approach, particularly with an SE-ResNeXt18 classifier, achieved superior performance (lower EERs) compared to existing methods across diverse voice spoofing categories, including logical access (LA), physical access (PA), and full/partial deepfake attacks. It demonstrated strong generalizability and effectiveness, significantly improving detection capabilities, especially when combining spectral and temporal coefficients.
Approach
The proposed method extracts frame-level Spectral Deviated Coefficients (SDC) using log-Mel spectrograms and a Local Deviated Pattern (LDP) operator. Utterance-level Sequential Temporal Coefficients (STC) are captured using a two-layer Bidirectional Long Short-Term Memory (Bi-LSTM) network. These SDC and STC features are then normalized, fused, and passed through an auto-encoder-decoder network to generate robust Spectra-Temporal Deviation Coefficients (STDC) for classification.
Datasets
ASVspoof2019, ASVspoof2021, VSDC, partial spoofs, in-the-wild audio deepfakes (IWA)
Model(s)
Bi-LSTM (for STC), Auto-encoder-decoder (for STDC generation), SE-ResNeXt18 (backend classifier, among others like Random Forest, KNN, SVM, Logistics Regression, Naive Bayes, Decision Tree, Ensemble, ResNet18, SE-ResNet18, ResNext18)
Author countries
USA, Belgium