Frame-to-Utterance Convergence: A Spectra-Temporal Approach for Unified Spoofing Detection
Authors: Awais Khan, Khalid Mahmood Malik, Shah Nawaz
Published: 2023-09-18 14:54:42+00:00
AI Summary
This paper introduces a unified spectra-temporal approach for detecting various voice spoofing attacks, including synthetic, replay, and partial deepfakes. The method leverages frame-level spectral deviation coefficients (SDC) and utterance-level sequential temporal coefficients (STC) through a bi-LSTM network. These coefficients are then fused and processed by an auto-encoder to generate robust spectra-temporal deviated coefficients (STDC), demonstrating enhanced performance across diverse spoofing categories.
Abstract
Voice spoofing attacks pose a significant threat to automated speaker verification systems. Existing anti-spoofing methods often simulate specific attack types, such as synthetic or replay attacks. However, in real-world scenarios, the countermeasures are unaware of the generation schema of the attack, necessitating a unified solution. Current unified solutions struggle to detect spoofing artifacts, especially with recent spoofing mechanisms. For instance, the spoofing algorithms inject spectral or temporal anomalies, which are challenging to identify. To this end, we present a spectra-temporal fusion leveraging frame-level and utterance-level coefficients. We introduce a novel local spectral deviation coefficient (SDC) for frame-level inconsistencies and employ a bi-LSTM-based network for sequential temporal coefficients (STC), which capture utterance-level artifacts. Our spectra-temporal fusion strategy combines these coefficients, and an auto-encoder generates spectra-temporal deviated coefficients (STDC) to enhance robustness. Our proposed approach addresses multiple spoofing categories, including synthetic, replay, and partial deepfake attacks. Extensive evaluation on diverse datasets (ASVspoof2019, ASVspoof2021, VSDC, partial spoofs, and in-the-wild deepfakes) demonstrated its robustness for a wide range of voice applications.