Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

Authors: Jordan J. Bird, Ahmad Lotfi

Published: 2023-08-24 12:26:15+00:00

AI Summary

This paper introduces the DEEP-VOICE dataset for AI-generated speech detection and demonstrates that an Extreme Gradient Boosting model achieves 99.3% accuracy in real-time classification of real versus AI-generated speech (using Retrieval-based Voice Conversion), with an inference time of around 0.004 milliseconds per second of audio.

Abstract

There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion. To address the above emerging issues, the DEEP-VOICE dataset is generated in this study, comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion. Presenting as a binary classification problem of whether the speech is real or AI-generated, statistical analysis of temporal audio features through t-testing reveals that there are significantly different distributions. Hyperparameter optimisation is implemented for machine learning models to identify the source of speech. Following the training of 208 individual machine learning models over 10-fold cross validation, it is found that the Extreme Gradient Boosting model can achieve an average classification accuracy of 99.3% and can classify speech in real-time, at around 0.004 milliseconds given one second of speech. All data generated for this study is released publicly for future research on AI speech detection.


Key findings
The Extreme Gradient Boosting model achieved the highest accuracy (99.3%) and real-time performance (0.004 milliseconds per second of audio). Statistical analysis showed significant differences in audio feature distributions between real and AI-generated speech. The dataset and findings are publicly released to promote further research.
Approach
The authors created the DEEP-VOICE dataset using real speech from eight public figures and their AI-generated counterparts created via Retrieval-based Voice Conversion. They then trained various machine learning models on extracted audio features, optimizing hyperparameters to maximize accuracy and minimize inference time. The best-performing model was Extreme Gradient Boosting.
Datasets
DEEP-VOICE dataset (created by the authors, containing real and AI-generated speech from eight public figures)
Model(s)
Extreme Gradient Boosting (XGBoost), Random Forests, Quadratic and Linear Discriminant Analyses, Ridge Regression, Gaussian and Bernoulli Naive Bayes, K-Nearest Neighbors, Support Vector Machines, Stochastic Gradient Descent, and Gaussian Process.
Author countries
UK