Generalized Spoofing Detection Inspired from Audio Generation Artifacts

Authors: Yang Gao, Tyler Vuong, Mahsa Elyasi, Gaurav Bharaj, Rita Singh

Published: 2021-04-08 23:02:56+00:00

AI Summary

This paper proposes using 2D Discrete Cosine Transform (DCT) on log-Mel spectrograms as a novel long-range spectro-temporal feature for audio deepfake detection. This feature effectively captures artifacts in generated audio, outperforming existing features like log-Mel spectrograms, CQCC, and MFCC, and leading to state-of-the-art performance on the ASVspoof 2019 challenge.

Abstract

State-of-the-art methods for audio generation suffer from fingerprint artifacts and repeated inconsistencies across temporal and spectral domains. Such artifacts could be well captured by the frequency domain analysis over the spectrogram. Thus, we propose a novel use of long-range spectro-temporal modulation feature -- 2D DCT over log-Mel spectrogram for the audio deepfake detection. We show that this feature works better than log-Mel spectrogram, CQCC, MFCC, as a suitable candidate to capture such artifacts. We employ spectrum augmentation and feature normalization to decrease overfitting and bridge the gap between training and test dataset along with this novel feature introduction. We developed a CNN-based baseline that achieved a 0.0849 t-DCF and outperformed the previously top single systems reported in the ASVspoof 2019 challenge. Finally, by combining our baseline with our proposed 2D DCT spectro-temporal feature, we decrease the t-DCF score down by 14% to 0.0737, making it a state-of-the-art system for spoofing detection. Furthermore, we evaluate our model using two external datasets, showing the proposed feature's generalization ability. We also provide analysis and ablation studies for our proposed feature and results.


Key findings
The proposed 2D DCT feature significantly improved the t-DCF score (by 14%, from 0.0849 to 0.0737) compared to the baseline, achieving state-of-the-art performance on ASVspoof 2019. The feature also demonstrated good generalization to external datasets and effectiveness in speaker verification tasks.
Approach
The authors propose a new feature, 2D DCT on log-Mel spectrograms, to capture artifacts present in generated audio. This feature is used with a CNN-based baseline model, and SpecAugment and feature normalization are applied to improve robustness and generalization. The final system combines this new feature with the baseline model.
Datasets
ASVspoof 2019 challenge dataset, FoR dataset, RTVCspoof dataset
Model(s)
CNN-based model with residual blocks and bidirectional GRUs
Author countries
USA