The Sound of Silence: Efficiency of First Digit Features in Synthetic Audio Detection

Authors: Daniele Mari, Federica Latora, Simone Milani

Published: 2022-10-06 08:31:21+00:00

Comment: Accepted at WIFS 2022

AI Summary

This paper investigates the discriminative role of silenced parts in synthetic speech detection, proposing a computationally-lightweight and robust method. It leverages first digit statistics extracted from MFCC coefficients to identify irregularities in these silent segments. The approach achieves over 90% accuracy on most ASVSpoof dataset classes, outperforming some state-of-the-art methods in open-set scenarios.

Abstract

The recent integration of generative neural strategies and audio processing techniques have fostered the widespread of synthetic speech synthesis or transformation algorithms. This capability proves to be harmful in many legal and informative processes (news, biometric authentication, audio evidence in courts, etc.). Thus, the development of efficient detection algorithms is both crucial and challenging due to the heterogeneity of forgery techniques. This work investigates the discriminative role of silenced parts in synthetic speech detection and shows how first digit statistics extracted from MFCC coefficients can efficiently enable a robust detection. The proposed procedure is computationally-lightweight and effective on many different algorithms since it does not rely on large neural detection architecture and obtains an accuracy above 90\\% in most of the classes of the ASVSpoof dataset.


Key findings
The study found that silenced parts within speech contain most of the discriminative information for synthetic audio detection, with performance on silent sections comparable to or better than using the full audio. Removing silent parts significantly reduces detection accuracy. The proposed method, based on first digit statistics and a Random Forest, is computationally lightweight and robust, showing better performance against unseen algorithms in off-set evaluation compared to some state-of-the-art approaches.
Approach
The authors propose extracting First Digit (FD) statistics from quantized MFCC coefficients, particularly focusing on silent parts of audio signals. These FD statistics are compared to Benford's law using various divergence measures (Shannon, Renyi, Tsallis, MSE), generating a set of features. A Random Forest classifier then uses these features to distinguish between bona fide and synthetic audio.
Datasets
ASVSpoof
Model(s)
Random Forest classifier
Author countries
Italy