The Sound of Silence: Efficiency of First Digit Features in Synthetic Audio Detection

View on arXiv ← Back to list

Authors: Daniele Mari, Federica Latora, Simone Milani

Published: 2022-10-06 08:31:21+00:00

AI Summary

This research investigates the effectiveness of first digit statistics extracted from MFCC coefficients of silenced speech segments for synthetic audio detection. The proposed method is computationally lightweight and achieves over 90% accuracy on the ASVSpoof dataset, outperforming some state-of-the-art approaches.

Abstract

The recent integration of generative neural strategies and audio processing techniques have fostered the widespread of synthetic speech synthesis or transformation algorithms. This capability proves to be harmful in many legal and informative processes (news, biometric authentication, audio evidence in courts, etc.). Thus, the development of efficient detection algorithms is both crucial and challenging due to the heterogeneity of forgery techniques. This work investigates the discriminative role of silenced parts in synthetic speech detection and shows how first digit statistics extracted from MFCC coefficients can efficiently enable a robust detection. The proposed procedure is computationally-lightweight and effective on many different algorithms since it does not rely on large neural detection architecture and obtains an accuracy above 90% in most of the classes of the ASVSpoof dataset.

Key findings

The analysis reveals that synthetic audio algorithms struggle to generate realistic silence, making it a highly discriminative feature. The proposed first-digit-based approach achieves high accuracy, exceeding that of some existing methods, while maintaining low computational cost. Performance on silent sections often matches or surpasses that on the full audio sequence.

Approach

The authors extract first digit statistics from Mel-Frequency Cepstral Coefficients (MFCCs) of audio segments, focusing on silent portions. These statistics are compared against Benford's law to distinguish between real and synthetic audio using a Random Forest classifier.

Datasets

ASVSpoof dataset

Model(s)

Random Forest

Author countries

Italy

← Previous