Deep Spectro-temporal Artifacts for Detecting Synthesized Speech

Authors: Xiaohui Liu, Meng Liu, Lin Zhang, Linjuan Zhang, Chang Zeng, Kai Li, Nan Li, Kong Aik Lee, Longbiao Wang, Jianwu Dang

Published: 2022-10-11 08:31:30+00:00

AI Summary

This paper presents a system for detecting synthesized speech in two tracks of the Audio Deep Synthesis Detection (ADD) Challenge: Low-quality Fake Audio Detection and Partially Fake Audio Detection. The approach leverages spectro-temporal artifacts using raw waveform, handcrafted features, and deep embeddings, incorporating techniques like data augmentation, domain adaptation, and a greedy fusion strategy.

Abstract

The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech. With our submitted system, this paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection). In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features. To address track 1, low-quality data augmentation, domain adaptation via finetuning, and various complementary feature information fusion were aggregated in our system. Furthermore, we analyzed the clustering characteristics of subsystems with different features by visualization method and explained the effectiveness of our proposed greedy fusion strategy. As for track 2, frame transition and smoothing were detected using self-supervised learning structure to capture the manipulation of PF attacks in the time domain. We ranked 4th and 5th in track 1 and track 2, respectively.


Key findings
The system achieved a 25.91% EER on the LF track, ranking 4th, and a 20.58% EER on the PF track, ranking 5th. Data augmentation and fine-tuning improved performance, and the greedy fusion strategy effectively combined complementary information from different models.
Approach
The authors utilize a multi-faceted approach combining various audio features (raw waveform, spectrograms, MFCCs, etc.) and deep learning models (SE-Res2Net50, RawNet2, ResNet-TCN). For low-quality audio, they employ data augmentation and fine-tuning for domain adaptation, and a greedy fusion strategy combines predictions from different models. For partially fake audio, self-supervised learning models extract temporal features.
Datasets
AISHELL-3 corpus; the ADD 2022 challenge dataset, including training, development, adaptation, and test sets for both Low-quality Fake Audio Detection (LF) and Partially Fake Audio Detection (PF) tracks.
Model(s)
SE-Res2Net50, RawNet2, ResNet-TCN, Bi-LSTM, Wav2Vec 2.0 Large, XLSR-53, WavLM Large
Author countries
China, Japan, Singapore