RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing

Authors: Yang Xiao, Ting Dang, Rohan Kumar Das

Published: 2025-07-11 00:24:47+00:00

Comment: Submitted to APSIPA ASC 2025

AI Summary

This paper introduces RawTFNet, a lightweight CNN model designed for speech anti-spoofing, which addresses the high computational cost of existing transformer-based models. RawTFNet improves performance by separating feature processing along time and frequency dimensions to capture fine-grained details of synthetic speech. Tested on ASVspoof 2021 LA and DF datasets, RawTFNet achieves comparable performance to state-of-the-art models while significantly reducing computational resources.

Abstract

Automatic speaker verification (ASV) systems are often affected by spoofing attacks. Recent transformer-based models have improved anti-spoofing performance by learning strong feature representations. However, these models usually need high computing power. To address this, we introduce RawTFNet, a lightweight CNN model designed for audio signals. The RawTFNet separates feature processing along time and frequency dimensions, which helps to capture the fine-grained details of synthetic speech. We tested RawTFNet on the ASVspoof 2021 LA and DF evaluation datasets. The results show that RawTFNet reaches comparable performance to that of the state-of-the-art models, while also using fewer computing resources. The code and models will be made publicly available.


Key findings
RawTFNet-16 (0.07M parameters, 2.9G MACs) and RawTFNet-32 (0.17M parameters, 5.4G MACs) achieve competitive anti-spoofing performance with significantly fewer computing resources compared to state-of-the-art models like AASIST and SE-Rawformer. RawTFNet-16 shows an EER of 4.50% and min t-DCF of 0.295 on ASVspoof 2021 LA, while RawTFNet-32 achieves the lowest EER of 16.82% on ASVspoof 2021 DF. Ablation studies confirm the critical role of separate time and frequency processing branches and the channel shuffle operation for the model's effectiveness.
Approach
The RawTFNet architecture is a lightweight CNN that processes raw audio signals. It uses a frontend with a Sinusoidal Convolution layer and DWS SE-Res2Net blocks to extract spectro-temporal features, followed by a novel Time-Frequency Convolutions (TF-Convs) module. TF-Convs separate and process feature maps along time and frequency dimensions using 1D depthwise convolutions to capture subtle artifacts.
Datasets
ASVspoof 2021 LA, ASVspoof 2021 DF, ASVspoof 2019 LA
Model(s)
RawTFNet (CNN architecture), Sinusoidal Convolution layer, DWS SE-Res2Net blocks, Time-Frequency Convolutions (TF-Convs)
Author countries
Australia, Singapore