Raw Differentiable Architecture Search for Speech Deepfake and Spoofing Detection

Authors: Wanying Ge, Jose Patino, Massimiliano Todisco, Nicholas Evans

Published: 2021-07-26 13:36:14+00:00

Comment: Accepted to ASVspoof 2021 Workshop

AI Summary

This paper introduces Raw PC-DARTS, an end-to-end differentiable architecture search method for speech deepfake and spoofing detection. The approach automatically learns the deep network architecture while jointly optimizing all network components and parameters, including a first convolutional layer that operates directly on raw audio signals. It demonstrates that a fully learned system can achieve competitive performance with state-of-the-art hand-crafted solutions.

Abstract

End-to-end approaches to anti-spoofing, especially those which operate directly upon the raw signal, are starting to be competitive with their more traditional counterparts. Until recently, all such approaches consider only the learning of network parameters; the network architecture is still hand crafted. This too, however, can also be learned. Described in this paper is our attempt to learn automatically the network architecture of a speech deepfake and spoofing detection solution, while jointly optimising other network components and parameters, such as the first convolutional layer which operates on raw signal inputs. The resulting raw differentiable architecture search system delivers a tandem detection cost function score of 0.0517 for the ASVspoof 2019 logical access database, a result which is among the best single-system results reported to date.


Key findings
The Raw PC-DARTS system achieved a min-tDCF score of 0.0517 and an EER of 1.77% on the ASVspoof 2019 LA evaluation partition, placing it among the best single-system results reported. The approach also showed superior generalization to unseen spoofing attacks, with a notably better worst-case EER compared to many competing systems. Dilated convolution operations were observed to dominate the learned architectures, indicating their importance for raw waveform processing.
Approach
The authors propose Raw PC-DARTS, an end-to-end system that applies differentiable architecture search directly to raw time-domain waveforms for speech deepfake detection. It uses learnable sinc filters as the first layer and a modified cell architecture (normal and expand cells with max-pooling) within the PC-DARTS framework to overcome dimensionality issues in raw audio processing. The system jointly optimizes both network architecture and parameters.
Datasets
ASVspoof 2019 Logical Access (LA) database
Model(s)
Raw PC-DARTS (based on PC-DARTS and DARTS), Sinc filters, 1D convolutional operations (standard and dilated convolutions, max/average pooling), Gated Recurrent Unit (GRU), Fully Connected (FC) layer.
Author countries
France