An End-to-End Multi-Module Audio Deepfake Generation System for ADD Challenge 2023

Authors: Sheng Zhao, Qilong Yuan, Yibo Duan, Zhuoyue Chen

Published: 2023-07-03 03:21:23+00:00

AI Summary

This paper presents an end-to-end multi-module audio deepfake generation system consisting of a speaker encoder, a Tacotron2-based synthesizer, and a WaveRNN-based vocoder. This system achieved first place in the ADD 2023 challenge Track 1.1, demonstrating high-quality synthetic speech generation.

Abstract

The task of synthetic speech generation is to generate language content from a given text, then simulating fake human voice.The key factors that determine the effect of synthetic speech generation mainly include speed of generation, accuracy of word segmentation, naturalness of synthesized speech, etc. This paper builds an end-to-end multi-module synthetic speech generation model, including speaker encoder, synthesizer based on Tacotron2, and vocoder based on WaveRNN. In addition, we perform a lot of comparative experiments on different datasets and various model structures. Finally, we won the first place in the ADD 2023 challenge Track 1.1 with the weighted deception success rate (WDSR) of 44.97%.


Key findings
The proposed model outperforms the baseline FastSpeech model in terms of Equal Error Rate (EER) on both AISHELL-3 and LibriTTS datasets. Using WaveRNN as the vocoder yields better results compared to Hifi-GAN. The system achieved first place in the ADD 2023 challenge Track 1.1 with a weighted deception success rate of 44.97%.
Approach
The authors address the problem of audio deepfake generation by creating a three-module system. This system uses a speaker encoder to provide speaker information to a Tacotron2 synthesizer, which generates speech features. These features are then converted to a waveform by a WaveRNN vocoder.
Datasets
AISHELL-3, LibriTTS, MUSAN, RIRs
Model(s)
BiLSTM, Tacotron2, WaveRNN, Hifi-GAN (used in ablation study), RawNet2, Res-TSSDNet, ECAPA-TDNN (used for detection)
Author countries
China