Complex-valued neural networks for voice anti-spoofing

View on arXiv ← Back to list

Authors: Nicolas M. Müller, Philip Sperl, Konstantin Böttinger

Published: 2023-08-22 21:49:38+00:00

AI Summary

This paper proposes using complex-valued neural networks to process complex-valued constant-Q transforms (CQT) of audio for voice anti-spoofing. This approach retains phase information, improving detection accuracy and enabling explainable AI methods. The results show superior performance compared to existing methods on the In-the-Wild dataset.

Abstract

Current anti-spoofing and audio deepfake detection systems use either magnitude spectrogram-based features (such as CQT or Melspectrograms) or raw audio processed through convolution or sinc-layers. Both methods have drawbacks: magnitude spectrograms discard phase information, which affects audio naturalness, and raw-feature-based models cannot use traditional explainable AI methods. This paper proposes a new approach that combines the benefits of both methods by using complex-valued neural networks to process the complex-valued, CQT frequency-domain representation of the input audio. This method retains phase information and allows for explainable AI methods. Results show that this approach outperforms previous methods on the In-the-Wild anti-spoofing dataset and enables interpretation of the results through explainable AI. Ablation studies confirm that the model has learned to use phase information to detect voice spoofing.

Key findings

The proposed complex-valued neural network approach outperforms previous state-of-the-art methods on the In-the-Wild dataset, achieving a lower EER. Ablation studies confirm the importance of phase information in the model's performance. Explainable AI techniques successfully reveal the model's decision-making process without reliance on learning shortcuts.

Approach

The authors propose using a complex-valued convolutional neural network to process complex-valued constant-Q transform (CQT) spectrograms of audio. This retains phase information discarded by magnitude-only methods, improving performance and enabling explainable AI techniques. The model is trained and evaluated using EER (Equal Error Rate).

Datasets

ASVspoof 2019 (Logical Access section) and In-the-Wild (ITW) dataset

Model(s)

Complex-valued convolutional neural network (CVNN) with complex ReLU activation and complex batch normalization.

Author countries

Germany

← Previous