Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

Authors: Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng

Published: 2024-06-25 08:50:43+00:00

AI Summary

This paper proposes a Temporal-Channel Modeling (TCM) module to improve synthetic speech detection by enhancing the multi-head self-attention mechanism in Transformer models. The TCM module effectively captures temporal-channel dependencies in the input speech representation, leading to significant performance gains.

Abstract

Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.


Key findings
The TCM module achieves a 9.25% improvement in EER on the ASVspoof 2021 DF track compared to the baseline XLSR-Conformer, with only a marginal increase in parameters. Ablation studies confirm the importance of both temporal and channel information for effective synthetic speech detection.
Approach
The authors propose a TCM module that integrates channel information (head tokens) with temporal information (input tokens) within the multi-head self-attention mechanism. This allows the model to better learn the interplay between temporal and spectral features crucial for distinguishing synthetic from real speech.
Datasets
ASVspoof 2019 (LA track for training and development), ASVspoof 2021 (LA and DF tracks for evaluation)
Model(s)
XLSR-Conformer (with the proposed TCM module)
Author countries
Singapore, Hong Kong