Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

Authors: Alexandre R. Ferreira, Cláudio E. C. Campelo

Published: 2023-09-22 11:33:03+00:00

AI Summary

This paper proposes a framework using deepfake audio for data augmentation in training automatic speech-to-text transcription models, addressing the scarcity of diverse labeled datasets for less popular languages. Experiments were conducted using a voice cloner and an Indian English dataset to evaluate the framework's impact on transcription accuracy.

Abstract

To train transcriptor models that produce robust results, a large and diverse labeled dataset is required. Finding such data with the necessary characteristics is a challenging task, especially for languages less popular than English. Moreover, producing such data requires significant effort and often money. Therefore, a strategy to mitigate this problem is the use of data augmentation techniques. In this work, we propose a framework that approaches data augmentation based on deepfake audio. To validate the produced framework, experiments were conducted using existing deepfake and transcription models. A voice cloner and a dataset produced by Indians (in English) were selected, ensuring the presence of a single accent in the dataset. Subsequently, the augmented data was used to train speech to text models in various scenarios.


Key findings
Experiments showed a decrease in transcription accuracy (increased WER) after training with the deepfake-augmented data. This was attributed to the relatively low quality of the audio generated by the voice cloner. Further improvements in the voice cloning model are suggested to enhance the effectiveness of this data augmentation technique.
Approach
The authors leverage a voice cloning model to generate deepfake audio samples from a smaller dataset, augmenting the training data for a speech-to-text model. They then fine-tune a pre-trained speech-to-text model using the augmented dataset and evaluate its performance using Word Error Rate (WER).
Datasets
NPTEL (NPTEL2020 – Indian English Speech Dataset), specifically the "Pure-Set" subset of 1000 manually transcribed audios.
Model(s)
Real-Time Voice Cloning (SV2TTS architecture: GE2E encoder, Tacotron 2 synthesizer, WaveRNN vocoder), DeepSpeech (Recurrent Neural Network based speech-to-text model).
Author countries
Brazil