Frame-to-Utterance Convergence: A Spectra-Temporal Approach for Unified Spoofing Detection

Authors: Awais Khan, Khalid Mahmood Malik, Shah Nawaz

Published: 2023-09-18 14:54:42+00:00

AI Summary

This paper proposes a unified voice spoofing detection method using a spectra-temporal fusion approach. It combines frame-level spectral deviation coefficients (SDC) with utterance-level sequential temporal coefficients (STC) via an autoencoder to generate robust spectra-temporal deviated coefficients (STDC), effectively detecting various spoofing attacks.

Abstract

Voice spoofing attacks pose a significant threat to automated speaker verification systems. Existing anti-spoofing methods often simulate specific attack types, such as synthetic or replay attacks. However, in real-world scenarios, the countermeasures are unaware of the generation schema of the attack, necessitating a unified solution. Current unified solutions struggle to detect spoofing artifacts, especially with recent spoofing mechanisms. For instance, the spoofing algorithms inject spectral or temporal anomalies, which are challenging to identify. To this end, we present a spectra-temporal fusion leveraging frame-level and utterance-level coefficients. We introduce a novel local spectral deviation coefficient (SDC) for frame-level inconsistencies and employ a bi-LSTM-based network for sequential temporal coefficients (STC), which capture utterance-level artifacts. Our spectra-temporal fusion strategy combines these coefficients, and an auto-encoder generates spectra-temporal deviated coefficients (STDC) to enhance robustness. Our proposed approach addresses multiple spoofing categories, including synthetic, replay, and partial deepfake attacks. Extensive evaluation on diverse datasets (ASVspoof2019, ASVspoof2021, VSDC, partial spoofs, and in-the-wild deepfakes) demonstrated its robustness for a wide range of voice applications.


Key findings
The proposed spectra-temporal fusion approach outperforms existing state-of-the-art methods in detecting various spoofing attacks including logical, physical, full, and partial deepfakes. The results demonstrate improved performance across diverse datasets when combining spectral and temporal features. The method shows robustness and generalizability across different attack types.
Approach
The approach extracts frame-level features using a novel local spectral deviation coefficient (SDC) and utterance-level features using a Bi-LSTM network to generate sequential temporal coefficients (STC). These are then fused using an autoencoder to produce spectra-temporal deviated coefficients (STDC) for robust spoofing detection.
Datasets
ASVspoof2019, ASVspoof2021, VSDC, partial spoofs (Utterance-based), and in-the-wild audio deepfakes (IWA)
Model(s)
Bi-LSTM, Autoencoder, Ensemble, SE-ResNext18, ResNet18, Random Forest, KNN, SVM, Logistic Regression, Naive Bayes, Decision Tree, ResNext18
Author countries
USA, Belgium