Multi-Task Learning in Utterance-Level and Segmental-Level Spoof Detection

Authors: Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi

Published: 2021-07-29 16:04:25+00:00

AI Summary

This paper proposes SELCNN, a light convolutional neural network with squeeze-and-excitation blocks for enhanced feature selection, and applies it within multi-task learning (MTL) frameworks for simultaneous utterance-level and segmental-level spoof detection in the PartialSpoof database. Experiments demonstrate that the multi-task binary-branch architecture, particularly when fine-tuned from a segmental warm-up model, outperforms single-task models.

Abstract

In this paper, we provide a series of multi-tasking benchmarks for simultaneously detecting spoofing at the segmental and utterance levels in the PartialSpoof database. First, we propose the SELCNN network, which inserts squeeze-and-excitation (SE) blocks into a light convolutional neural network (LCNN) to enhance the capacity of hidden feature selection. Then, we implement multi-task learning (MTL) frameworks with SELCNN followed by bidirectional long short-term memory (Bi-LSTM) as the basic model. We discuss MTL in PartialSpoof in terms of architecture (uni-branch/multi-branch) and training strategies (from-scratch/warm-up) step-by-step. Experiments show that the multi-task model performs relatively better than single-task models. Also, in MTL, a binary-branch architecture more adequately utilizes information from two levels than a uni-branch model. For the binary-branch architecture, fine-tuning a warm-up model works better than training from scratch. Models can handle both segment-level and utterance-level predictions simultaneously overall under a binary-branch multi-task architecture. Furthermore, the multi-task model trained by fine-tuning a segmental warm-up model performs relatively better at both levels except on the evaluation set for segmental detection. Segmental detection should be explored further.


Key findings
Multi-task models significantly outperform single-task models for simultaneous utterance and segment-level spoof detection. The binary-branch architecture is more effective than the uni-branch architecture. Fine-tuning a model from a segmental warm-up model yields the best results, indicating the benefit of incorporating fine-grained segmental information.
Approach
The authors propose a multi-task learning approach using a modified LCNN (SELCNN) followed by Bi-LSTM. They explore uni-branch and binary-branch architectures, as well as training strategies (from scratch and warm-up), to detect spoofing at both utterance and segment levels simultaneously.
Datasets
PartialSpoof database
Model(s)
SELCNN (Squeeze-and-Excitation blocks inserted into a Light Convolutional Neural Network) + Bi-LSTM
Author countries
Japan