TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection

View on arXiv ← Back to list

Authors: Chenglong Wang, Jiangyan Yi, Jianhua Tao, Chuyuan Zhang, Shuai Zhang, Ruibo Fu, Xun Chen

Published: 2023-05-23 05:30:17+00:00

AI Summary

This paper proposes TO-RawNet, a novel deep neural network architecture for fake audio detection. It improves upon RawNet by incorporating orthogonal convolution to reduce filter correlation and temporal convolutional networks (TCNs) to capture long-term dependencies in speech signals, resulting in a significant reduction in Equal Error Rate (EER).

Abstract

Current fake audio detection relies on hand-crafted features, which lose information during extraction. To overcome this, recent studies use direct feature extraction from raw audio signals. For example, RawNet is one of the representative works in end-to-end fake audio detection. However, existing work on RawNet does not optimize the parameters of the Sinc-conv during training, which limited its performance. In this paper, we propose to incorporate orthogonal convolution into RawNet, which reduces the correlation between filters when optimizing the parameters of Sinc-conv, thus improving discriminability. Additionally, we introduce temporal convolutional networks (TCN) to capture long-term dependencies in speech signals. Experiments on the ASVspoof 2019 show that the Our TO-RawNet system can relatively reduce EER by 66.09% on logical access scenario compared with the RawNet, demonstrating its effectiveness in detecting fake audio attacks.

Key findings

TO-RawNet achieved a 66.09% relative reduction in EER compared to RawNet on the ASVspoof 2019 LA dataset. Ablation studies confirmed the positive contributions of both orthogonal convolution and TCNs. Applying orthogonal regularization to the state-of-the-art AASIST model further improved its performance, highlighting the generalizability of this technique.

Approach

TO-RawNet enhances RawNet with orthogonal convolution applied to the Sinc-conv layer to improve filter discriminability and reduce correlation. It further incorporates TCNs to capture long-range temporal dependencies in audio signals. This combined approach leads to improved performance in fake audio detection.

Datasets

ASVspoof 2019 and ASVspoof 2021 Logical Access (LA) datasets

Model(s)

TO-RawNet (based on RawNet, incorporating orthogonal convolution and Temporal Convolutional Networks (TCNs)), RawNet, Orth-AASIST (AASIST with orthogonal regularization)

Author countries

China

← Previous