ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis

Authors: Hawau Olamide Toyin, Rufael Marew, Humaid Alblooshi, Samar M. Magdy, Hanan Aldarmaki

Published: 2025-05-26 20:15:15+00:00

AI Summary

ArVoice is a new multi-speaker Modern Standard Arabic (MSA) speech corpus with diacritized transcriptions, designed for multi-speaker speech synthesis and useful for tasks like deepfake detection. It comprises professionally recorded speech, a modified subset of the Arabic Speech Corpus, and synthetic speech, totaling 83.52 hours across 11 voices.

Abstract

We introduce ArVoice, a multi-speaker Modern Standard Arabic (MSA) speech corpus with diacritized transcriptions, intended for multi-speaker speech synthesis, and can be useful for other tasks such as speech-based diacritic restoration, voice conversion, and deepfake detection. ArVoice comprises: (1) a new professionally recorded set from six voice talents with diverse demographics, (2) a modified subset of the Arabic Speech Corpus; and (3) high-quality synthetic speech from two commercial systems. The complete corpus consists of a total of 83.52 hours of speech across 11 voices; around 10 hours consist of human voices from 7 speakers. We train three open-source TTS and two voice conversion systems to illustrate the use cases of the dataset. The corpus is available for research use.


Key findings
Experiments show that diacritized transcripts improve TTS performance. Synthetic data augmentation enhances VITS model performance. The KNN-VC and AAS-VC voice conversion models achieve relatively high similarity scores, though the Fish-Speech TTS model exhibited low intelligibility.
Approach
The paper introduces ArVoice, a new dataset for Arabic speech synthesis. It combines professionally recorded speech from multiple speakers with modified existing datasets and synthetic speech generated using commercial TTS systems. The dataset is used to train and evaluate multiple open-source TTS and voice conversion models.
Datasets
ArVoice (includes professionally recorded speech from six voice talents, a modified subset of the Arabic Speech Corpus (ASC), and synthetic speech from two commercial systems), Arabic Speech Corpus (ASC), Tashkeela Corpus, Khaleej Corpus.
Model(s)
ArTST-tts, VITS, Fish-Speech (for TTS); AAS-VC, KNN-VC (for voice conversion); HiFi-GAN, ParallelWaveGAN (as vocoders); several open-source Arabic ASR models for evaluation.
Author countries
UAE