Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection

View on arXiv ← Back to list

Authors: Ivan Viakhirev, Daniil Sirota, Aleksandr Smirnov, Kirill Borodin

Published: 2025-07-15 22:31:43+00:00

AI Summary

This paper refines the AASIST architecture for speech deepfake detection by freezing a Wav2Vec 2.0 encoder, replacing graph attention with multi-head attention, and using a trainable fusion layer. These modifications achieve a 7.6% equal error rate (EER) on the ASVspoof 5 corpus, significantly improving upon the baseline.

Abstract

Advances in voice conversion and text-to-speech synthesis have made automatic speaker verification (ASV) systems more susceptible to spoofing attacks. This work explores modest refinements to the AASIST anti-spoofing architecture. It incorporates a frozen Wav2Vec 2.0 encoder to retain self-supervised speech representations in limited-data settings, substitutes the original graph attention block with a standardized multi-head attention module using heterogeneous query projections, and replaces heuristic frame-segment fusion with a trainable, context-aware integration layer. When evaluated on the ASVspoof 5 corpus, the proposed system reaches a 7.6% equal error rate (EER), improving on a re-implemented AASIST baseline under the same training conditions. Ablation experiments suggest that each architectural change contributes to the overall performance, indicating that targeted adjustments to established models may help strengthen speech deepfake detection in practical scenarios. The code is publicly available at https://github.com/KORALLLL/AASIST_SCALING.

Key findings

Freezing the Wav2Vec 2.0 encoder, using multi-head attention, and a trainable fusion layer significantly improved the EER to 7.6% on ASVspoof 5. Ablation studies showed that each modification contributed to the performance improvement. The results suggest that targeted adjustments to existing models can enhance speech deepfake detection.

Approach

The authors improve the AASIST anti-spoofing model by freezing the Wav2Vec 2.0 encoder to leverage pre-trained representations, substituting the original graph attention blocks with multi-head attention for efficiency and simplicity, and replacing the heuristic frame-segment fusion with a trainable, context-aware integration layer.

Datasets

ASVspoof 5 corpus

Model(s)

Modified AASIST architecture with Wav2Vec 2.0 encoder, multi-head attention, and a trainable fusion layer.

Author countries

Russian Federation

← Previous