Defense Against Synthetic Speech: Real-Time Detection of RVC Voice Conversion Attacks

Authors: Prajwal Chinchmalatpure, Suyash Chinchmalatpure, Siddharth Chavan

Published: 2025-12-31 02:06:42+00:00

Journal Ref: IJRAR Int. J. Res. Anal. Rev., vol. 12, no. 4, pp. 102-109, 2025

AI Summary

This study focuses on the real-time detection of AI-generated speech produced using Retrieval-based Voice Conversion (RVC), crucial for mitigating impersonation and fraud. The researchers propose a streaming classification approach that segments audio into one-second windows, extracts acoustic features, and employs supervised machine learning models to classify each segment as real or voice-converted. This method allows for low-latency inference and demonstrates the feasibility of practical, real-time deepfake speech detection under realistic audio mixing conditions.

Abstract

Generative audio technologies now enable highly realistic voice cloning and real-time voice conversion, increasing the risk of impersonation, fraud, and misinformation in communication channels such as phone and video calls. This study investigates real-time detection of AI-generated speech produced using Retrieval-based Voice Conversion (RVC), evaluated on the DEEP-VOICE dataset, which includes authentic and voice-converted speech samples from multiple well-known speakers. To simulate realistic conditions, deepfake generation is applied to isolated vocal components, followed by the reintroduction of background ambiance to suppress trivial artifacts and emphasize conversion-specific cues. We frame detection as a streaming classification task by dividing audio into one-second segments, extracting time-frequency and cepstral features, and training supervised machine learning models to classify each segment as real or voice-converted. The proposed system enables low-latency inference, supporting both segment-level decisions and call-level aggregation. Experimental results show that short-window acoustic features can reliably capture discriminative patterns associated with RVC speech, even in noisy backgrounds. These findings demonstrate the feasibility of practical, real-time deepfake speech detection and underscore the importance of evaluating under realistic audio mixing conditions for robust deployment.


Key findings
Short-window acoustic features can reliably capture discriminative patterns associated with RVC speech, even in noisy backgrounds, indicating the feasibility of practical, real-time detection. The training process showed continuous accuracy improvement and loss reduction, suggesting that the feature representation effectively distinguishes between converted and genuine speech. The study emphasizes the importance of evaluating under realistic audio mixing conditions (reintroducing ambient noise after conversion) for robust deployment, ensuring detectors learn conversion-specific artifacts.
Approach
The approach frames detection as a streaming classification task, segmenting audio into one-second windows. Time-frequency and cepstral features are extracted from each segment, and supervised machine learning models are trained to classify them as real or RVC voice-converted. The deepfake generation process explicitly reintroduces background ambiance to ensure models learn conversion-specific cues rather than trivial artifacts.
Datasets
DEEP-VOICE dataset
Model(s)
Supervised machine learning models including logistic regression, support vector machines, tree-based ensembles (random forests and gradient boosting), k-nearest neighbors, and a feed-forward neural network.
Author countries
United States, Canada, India