AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

View on arXiv ← Back to list

Authors: Kirill Borodin, Vasiliy Kudryavtsev, Dmitrii Korzh, Alexey Efimenko, Grach Mkrtchian, Mikhail Gorodnichev, Oleg Y. Rogov

Published: 2024-08-30 15:30:01+00:00

AI Summary

The paper introduces AASIST3, a novel architecture for speech deepfake detection that enhances the AASIST framework with Kolmogorov-Arnold networks and additional layers. This results in a more than twofold performance improvement, achieving minDCF scores of 0.5357 (closed condition) and 0.1414 (open condition).

Abstract

Automatic Speaker Verification (ASV) systems, which identify speakers based on their voice characteristics, have numerous applications, such as user authentication in financial transactions, exclusive access control in smart devices, and forensic fraud detection. However, the advancement of deep learning algorithms has enabled the generation of synthetic audio through Text-to-Speech (TTS) and Voice Conversion (VC) systems, exposing ASV systems to potential vulnerabilities. To counteract this, we propose a novel architecture named AASIST3. By enhancing the existing AASIST framework with Kolmogorov-Arnold networks, additional layers, encoders, and pre-emphasis techniques, AASIST3 achieves a more than twofold improvement in performance. It demonstrates minDCF results of 0.5357 in the closed condition and 0.1414 in the open condition, significantly enhancing the detection of synthetic voices and improving ASV security.

Key findings

AASIST3 significantly outperforms the original AASIST model, achieving over a twofold improvement in performance. MinDCF scores of 0.5357 (closed) and 0.1414 (open) demonstrate its effectiveness in detecting synthetic speech. Various experiments showed that pre-emphasis and the use of a combination of models improved results.

Approach

AASIST3 enhances the AASIST architecture by integrating Kolmogorov-Arnold networks (KANs) into attention layers and adding more layers and encoders. It uses pre-emphasis techniques for audio preprocessing and combines predictions from multiple models for the open condition.

Datasets

ASVspoof 2024 Challenge datasets, Mozilla CommonVoice, VoxCeleb2

Model(s)

AASIST3 (based on AASIST), Wav2Vec2 XLS-R, KAN-GAL, KAN-GraphPool, KAN-HS-GAL

Author countries

Russia

← Previous