Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models

Authors: Borodin Kirill Nikolayevich, Kudryavtsev Vasiliy Dmitrievich, Mkrtchian Grach Maratovich, Gorodnichev Mikhail Genadievich, Korzh Dmitrii Sergeevich

Published: 2024-06-27 15:08:51+00:00

AI Summary

This paper presents an automatic speaker verification (ASV) system that extracts embeddings from audio to capture voice characteristics like pitch and phoneme duration. This system was used in the SSTC challenge to verify voice-converted audio, achieving an equal error rate (EER) of 20.669%.

Abstract

One of the most crucial components in the field of biometric security is the automatic speaker verification system, which is based on the speaker's voice. It is possible to utilise ASVs in isolation or in conjunction with other AI models. In the contemporary era, the quality and quantity of neural networks are increasing exponentially. Concurrently, there is a growing number of systems that aim to manipulate data through the use of voice conversion and text-to-speech models. The field of voice biometrics forgery is aided by a number of challenges, including SSTC, ASVSpoof, and SingFake. This paper presents a system for automatic speaker verification. The primary objective of our model is the extraction of embeddings from the target speaker's audio in order to obtain information about important characteristics of his voice, such as pitch, energy, and the duration of phonemes. This information is used in our multivoice TTS pipeline, which is currently under development. However, this model was employed within the SSTC challenge to verify users whose voice had undergone voice conversion, where it demonstrated an EER of 20.669.


Key findings
The ASV system achieved a TPR@FPR=0.01 of 0.5920 on a test set. In the SSTC challenge, an EER of 20.669% was achieved after ensembling with a baseline model. The embeddings also improved the performance of a phoneme duration predictor in a TTS pipeline.
Approach
The approach uses three encoders (CQT, Mel-spectrogram, and Pitch) to extract features from audio. These features are concatenated and passed through a fully-connected layer with an AM-Softmax loss function for speaker verification. The resulting embeddings are also used to improve a duration predictor in a TTS pipeline.
Datasets
LibriTTS-R (for training), a Kaggle dataset with speeches from five speakers (for testing), CMU ARCTIC (for testing), and the SSTC dataset (for the SSTC challenge).
Model(s)
The model consists of three encoders (CQT encoder using Specblock, Mel-spectrogram encoder using ViT, and Pitch encoder using ViT), followed by a fully-connected layer and AM-Softmax loss function. Ensemble methods were also used with a baseline model from the SSTC challenge.
Author countries
Russia