Easy, Interpretable, Effective: openSMILE for voice deepfake detection

Authors: Octavian Pascu, Dan Oneata, Horia Cucu, Nicolas M. Müller

Published: 2024-08-28 13:14:18+00:00

AI Summary

This paper demonstrates highly accurate voice deepfake detection on the ASVspoof5 dataset using a small subset of simple, interpretable features extracted from the openSMILE library. These features, such as mean unvoiced segment length, achieve surprisingly low equal error rates (EERs), with an overall EER of 15.7 ± 6.0%.

Abstract

In this paper, we demonstrate that attacks in the latest ASVspoof5 dataset -- a de facto standard in the field of voice authenticity and deepfake detection -- can be identified with surprising accuracy using a small subset of very simplistic features. These are derived from the openSMILE library, and are scalar-valued, easy to compute, and human interpretable. For example, attack A10`s unvoiced segments have a mean length of 0.09 +- 0.02, while bona fide instances have a mean length of 0.18 +- 0.07. Using this feature alone, a threshold classifier achieves an Equal Error Rate (EER) of 10.3% for attack A10. Similarly, across all attacks, we achieve up to 0.8% EER, with an overall EER of 15.7 +- 6.0%. We explore the generalization capabilities of these features and find that some of them transfer effectively between attacks, primarily when the attacks originate from similar Text-to-Speech (TTS) architectures. This finding may indicate that voice anti-spoofing is, in part, a problem of identifying and remembering signatures or fingerprints of individual TTS systems. This allows to better understand anti-spoofing models and their challenges in real-world application.


Key findings
Single openSMILE features achieve surprisingly low EERs for in-domain detection, with some features generalizing well across attacks from similar TTS architectures. Generalization across different TTS architectures is significantly more challenging. While Wav2Vec2 performs better overall, openSMILE offers superior interpretability.
Approach
The authors use openSMILE's 'eGeMAPSv2' feature set to extract scalar-valued, interpretable audio features. They then employ a simple threshold classifier or linear regression model trained on these features to distinguish between real and spoofed speech. The performance of individual features and their generalizability across different attacks (TTS systems) are analyzed.
Datasets
ASVspoof5 dataset
Model(s)
Threshold classifier and linear regression model. Wav2Vec2.0 is used for comparison purposes but not the primary model in the proposed approach.
Author countries
Romania, Germany