HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods

Authors: Hyun-seo Shin, Jungwoo Heo, Ju-ho Kim, Chan-yeong Lim, Wonbin Kim, Ha-Jin Yu

Published: 2023-09-15 07:18:30+00:00

Comment: Submitted to 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)

AI Summary

This paper introduces HM-Conformer, a modified Conformer-based system for audio deepfake detection, addressing the sub-optimal direct application of Conformer to classification tasks. It integrates hierarchical pooling to reduce sequence length and duplicated information, alongside a multi-level classification token aggregation method to gather features from different blocks. HM-Conformer efficiently detects spoofing evidence by processing and aggregating information from various sequence lengths, achieving a competitive 15.71% Equal Error Rate (EER) on the ASVspoof 2021 Deepfake dataset.

Abstract

Audio deepfake detection (ADD) is the task of detecting spoofing attacks generated by text-to-speech or voice conversion systems. Spoofing evidence, which helps to distinguish between spoofed and bona-fide utterances, might exist either locally or globally in the input features. To capture these, the Conformer, which consists of Transformers and CNN, possesses a suitable structure. However, since the Conformer was designed for sequence-to-sequence tasks, its direct application to ADD tasks may be sub-optimal. To tackle this limitation, we propose HM-Conformer by adopting two components: (1) Hierarchical pooling method progressively reducing the sequence length to eliminate duplicated information (2) Multi-level classification token aggregation method utilizing classification tokens to gather information from different blocks. Owing to these components, HM-Conformer can efficiently detect spoofing evidence by processing various sequence lengths and aggregating them. In experimental results on the ASVspoof 2021 Deepfake dataset, HM-Conformer achieved a 15.71% EER, showing competitive performance compared to recent systems.


Key findings
HM-Conformer achieved an EER of 15.71% on the ASVspoof 2021 Deepfake evaluation dataset, demonstrating a significant improvement over the Conformer baseline (18.91% EER) and competitive performance against other recent systems. Both hierarchical pooling and multi-level classification token aggregation methods were validated as effective in enhancing performance for the audio deepfake detection task.
Approach
The HM-Conformer modifies the Conformer architecture by incorporating a hierarchical pooling method that progressively reduces sequence length to condense features. It also employs a multi-level classification token aggregation (MCA) method, utilizing classification tokens from different Conformer blocks, each trained with auxiliary losses, to aggregate task-relevant information.
Datasets
ASVspoof 2021 Deepfake dataset (evaluation), ASVspoof 2019 logical access training dataset (training)
Model(s)
HM-Conformer (a modified Conformer architecture), Conformer, Transformers, CNN
Author countries
South Korea