Integrated Replay Spoofing-aware Text-independent Speaker Verification

View on arXiv ← Back to list

Authors: Hye-jin Shim, Jee-weon Jung, Ju-ho Kim, Seung-bin Kim, Ha-Jin Yu

Published: 2020-06-10 01:24:55+00:00

AI Summary

This paper proposes two approaches for integrated speaker verification and presentation attack detection: a monolithic end-to-end approach and a modular back-end approach. Experiments show that the modular approach, using separate DNNs for speaker verification and presentation attack detection, yields a 21.77% relative improvement in equal error rate compared to a conventional system.

Abstract

A number of studies have successfully developed speaker verification or presentation attack detection systems. However, studies integrating the two tasks remain in the preliminary stages. In this paper, we propose two approaches for building an integrated system of speaker verification and presentation attack detection: an end-to-end monolithic approach and a back-end modular approach. The first approach simultaneously trains speaker identification, presentation attack detection, and the integrated system using multi-task learning using a common feature. However, through experiments, we hypothesize that the information required for performing speaker verification and presentation attack detection might differ because speaker verification systems try to remove device-specific information from speaker embeddings, while presentation attack detection systems exploit such information. Therefore, we propose a back-end modular approach using a separate deep neural network (DNN) for speaker verification and presentation attack detection. This approach has thee input components: two speaker embeddings (for enrollment and test each) and prediction of presentation attacks. Experiments are conducted using the ASVspoof 2017-v2 dataset, which includes official trials on the integration of speaker verification and presentation attack detection. The proposed back-end approach demonstrates a relative improvement of 21.77% in terms of the equal error rate for integrated trials compared to a conventional speaker verification system.

Key findings

The monolithic end-to-end approach performed poorly, suggesting that the information needed for speaker verification and presentation attack detection differs. The modular back-end approach significantly improved the equal error rate for integrated speaker verification, achieving a 21.77% relative improvement over a conventional system. The modular approach effectively separated the score distributions for genuine and spoofed audio.

Approach

The paper proposes two approaches: a monolithic end-to-end approach using multi-task learning on a common feature and a modular back-end approach using separate DNNs for speaker verification and presentation attack detection, with the final decision made by combining the outputs of both models. The modular approach proved superior.

Datasets

ASVspoof 2017-v2 dataset

Model(s)

Light CNN (LCNN)

Author countries

Republic of Korea

← Previous