All-for-One and One-For-All: Deep learning-based feature fusion for Synthetic Speech Detection

Authors: Daniele Mari, Davide Salvi, Paolo Bestagini, Simone Milani

Published: 2023-07-28 13:50:25+00:00

AI Summary

This paper proposes a deep learning-based synthetic speech detection model that fuses three different feature sets (FD, STLT, and bicoherence features) for improved performance. The fused model outperforms state-of-the-art solutions and demonstrates robustness to anti-forensic attacks.

Abstract

Recent advances in deep learning and computer vision have made the synthesis and counterfeiting of multimedia content more accessible than ever, leading to possible threats and dangers from malicious users. In the audio field, we are witnessing the growth of speech deepfake generation techniques, which solicit the development of synthetic speech detection algorithms to counter possible mischievous uses such as frauds or identity thefts. In this paper, we consider three different feature sets proposed in the literature for the synthetic speech detection task and present a model that fuses them, achieving overall better performances with respect to the state-of-the-art solutions. The system was tested on different scenarios and datasets to prove its robustness to anti-forensic attacks and its generalization capabilities.


Key findings
The fused model significantly outperforms individual models using only one feature set. The model shows good generalization capabilities on unseen datasets. The model is relatively robust to MP3 compression but less so to high levels of Gaussian noise.
Approach
The approach uses three existing audio feature sets representing different aspects of speech signals. These features are fed into separate fully connected networks to generate embeddings, which are then concatenated and input into a final fully connected network for classification.
Datasets
ASVspoof 2019, LJSpeech, LibriSpeech, Cloud2019, VidTIMIT
Model(s)
Multiple fully connected (FC) neural networks, one for each feature set and a final FC network for fusion and classification. LeakyReLU activation and Softmax output are used.
Author countries
Italy, Italy