Raw Differentiable Architecture Search for Speech Deepfake and Spoofing Detection

Authors: Wanying Ge, Jose Patino, Massimiliano Todisco, Nicholas Evans

Published: 2021-07-26 13:36:14+00:00

AI Summary

This paper introduces Raw PC-DARTS, an end-to-end speech deepfake and spoofing detection system that automatically learns its network architecture from raw audio waveforms. The system achieves a state-of-the-art tandem detection cost function score of 0.0517 on the ASVspoof 2019 logical access database.

Abstract

End-to-end approaches to anti-spoofing, especially those which operate directly upon the raw signal, are starting to be competitive with their more traditional counterparts. Until recently, all such approaches consider only the learning of network parameters; the network architecture is still hand crafted. This too, however, can also be learned. Described in this paper is our attempt to learn automatically the network architecture of a speech deepfake and spoofing detection solution, while jointly optimising other network components and parameters, such as the first convolutional layer which operates on raw signal inputs. The resulting raw differentiable architecture search system delivers a tandem detection cost function score of 0.0517 for the ASVspoof 2019 logical access database, a result which is among the best single-system results reported to date.


Key findings
Raw PC-DARTS achieves a state-of-the-art min-tDCF score of 0.0517 and an EER of 1.77% on the ASVspoof 2019 LA evaluation set. The system demonstrates strong generalization capabilities, performing well even against the challenging A17 attack. Dilated convolutions were found to be dominant in the learned architectures.
Approach
Raw PC-DARTS uses differentiable architecture search to jointly optimize the network architecture and parameters, including a first convolutional layer operating directly on raw audio. This allows the system to learn both the optimal features and classifier simultaneously from the raw waveform.
Datasets
ASVspoof 2019 Logical Access (LA) database
Model(s)
Partially-connected Differentiable Architecture Search (PC-DARTS) with sinc filters as the initial layer; Gated Recurrent Unit (GRU) for utterance-level representation; 1D convolutional operations (standard and dilated convolutions, max/average pooling, skip connections) within the cells.
Author countries
France