Towards robust audio spoofing detection: a detailed comparison of traditional and learned features

Authors: Balamurali BT, Kin Wah Edward Lin, Simon Lui, Jer-Ming Chen, Dorien Herremans

Published: 2019-05-28 06:51:18+00:00

Journal Ref: IEEE Access. 2019

AI Summary

This research investigates robust audio features for detecting replay spoofing attacks against automatic speaker verification systems, aiming to overcome the limitation of existing systems that depend on knowing the spoofing technique. The authors compare traditional audio features with those learned through an autoencoder and propose a hybrid system that combines both types of features. This approach provides a detailed methodology for setting up state-of-the-art audio feature detection, preprocessing, and postprocessing, evaluated on the ASVspoof 2017 dataset.

Abstract

Automatic speaker verification, like every other biometric system, is vulnerable to spoofing attacks. Using only a few minutes of recorded voice of a genuine client of a speaker verification system, attackers can develop a variety of spoofing attacks that might trick such systems. Detecting these attacks using the audio cues present in the recordings is an important challenge. Most existing spoofing detection systems depend on knowing the used spoofing technique. With this research, we aim at overcoming this limitation, by examining robust audio features, both traditional and those learned through an autoencoder, that are generalizable over different types of replay spoofing. Furthermore, we provide a detailed account of all the steps necessary in setting up state-of-the-art audio feature detection, pre-, and postprocessing, such that the (non-audio expert) machine learning researcher can implement such systems. Finally, we evaluate the performance of our robust replay speaker detection system with a wide variety and different combinations of both extracted and machine learned audio features on the `out in the wild' ASVspoof 2017 dataset. This dataset contains a variety of new spoofing configurations. Since our focus is on examining which features will ensure robustness, we base our system on a traditional Gaussian Mixture Model-Universal Background Model. We then systematically investigate the relative contribution of each feature set. The fused models, based on both the known audio features and the machine learned features respectively, have a comparable performance with an Equal Error Rate (EER) of 12. The final best performing model, which obtains an EER of 10.8, is a hybrid model that contains both known and machine learned features, thus revealing the importance of incorporating both types of features when developing a robust spoofing prediction model.


Key findings
The hybrid system, which incorporates both traditional and machine-learned features and is trained on an augmented dataset, achieved the best performance with an Equal Error Rate (EER) of 10.8%. Models based solely on fused known audio features or fused autoencoder-learned features showed comparable performance around 12% EER. Constant Q Cepstral Coefficients (CQCCs) were the best-performing individual traditional feature set, while MFCCs performed the worst.
Approach
The proposed system utilizes a Gaussian Mixture Model-Universal Background Model (GMM-UBM) and compares the effectiveness of 11 traditional audio feature sets with features learned via an autoencoder. The autoencoder also augments the training dataset by reconstructing genuine and spoofed recordings. A logistic regression fusion model combines the likelihoods from individual GMM-UBM models trained on these diverse feature sets to produce a final prediction.
Datasets
ASVspoof 2017 dataset (protocol V2)
Model(s)
Gaussian Mixture Model-Universal Background Model (GMM-UBM), Autoencoder (feedforward neural network), Logistic Regression Fusion
Author countries
Singapore, China