End-to-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection

Authors: Hemlata Tak, Jee-weon Jung, Jose Patino, Madhu Kamble, Massimiliano Todisco, Nicholas Evans

Published: 2021-07-27 10:11:41+00:00

AI Summary

This paper proposes RawGAT-ST, a spectro-temporal graph attention network for speech deepfake detection. It achieves this by learning relationships between spectral and temporal cues directly from raw waveforms, using a novel graph fusion and pooling strategy. The model achieves a state-of-the-art equal error rate of 1.06% on the ASVspoof 2019 logical access database.

Abstract

Artefacts that serve to distinguish bona fide speech from spoofed or deepfake speech are known to reside in specific subbands and temporal segments. Various approaches can be used to capture and model such artefacts, however, none works well across a spectrum of diverse spoofing attacks. Reliable detection then often depends upon the fusion of multiple detection systems, each tuned to detect different forms of attack. In this paper we show that better performance can be achieved when the fusion is performed within the model itself and when the representation is learned automatically from raw waveform inputs. The principal contribution is a spectro-temporal graph attention network (GAT) which learns the relationship between cues spanning different sub-bands and temporal intervals. Using a model-level graph fusion of spectral (S) and temporal (T) sub-graphs and a graph pooling strategy to improve discrimination, the proposed RawGAT-ST model achieves an equal error rate of 1.06 % for the ASVspoof 2019 logical access database. This is one of the best results reported to date and is reproducible using an open source implementation.


Key findings
RawGAT-ST significantly outperforms the baseline RawNet2 system and other state-of-the-art methods, achieving a low equal error rate of 1.06%. Ablation studies highlight the importance of both spectral and temporal attention, as well as the graph pooling strategy. The model's performance demonstrates the effectiveness of model-level fusion and the benefit of processing raw waveforms.
Approach
RawGAT-ST uses a sinc convolution layer to process raw waveforms, followed by a 2D residual network to learn higher-level features. These features are then fed into separate spectral and temporal graph attention networks, which are fused at the model level before final classification. A graph pooling strategy is employed to improve discrimination.
Datasets
ASVspoof 2019 logical access (LA) database
Model(s)
Spectro-temporal Graph Attention Network (GAT), 2D residual network, sinc convolution layer
Author countries
France, South Korea