RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing

Authors: Yang Xiao, Ting Dang, Rohan Kumar Das

Published: 2025-07-11 00:24:47+00:00

AI Summary

RawTFNet is a lightweight CNN architecture for speech anti-spoofing that achieves state-of-the-art performance while using fewer computing resources. It separates feature processing along time and frequency dimensions to capture fine-grained details of synthetic speech, showing comparable performance to heavier models on ASVspoof 2021 datasets.

Abstract

Automatic speaker verification (ASV) systems are often affected by spoofing attacks. Recent transformer-based models have improved anti-spoofing performance by learning strong feature representations. However, these models usually need high computing power. To address this, we introduce RawTFNet, a lightweight CNN model designed for audio signals. The RawTFNet separates feature processing along time and frequency dimensions, which helps to capture the fine-grained details of synthetic speech. We tested RawTFNet on the ASVspoof 2021 LA and DF evaluation datasets. The results show that RawTFNet reaches comparable performance to that of the state-of-the-art models, while also using fewer computing resources. The code and models will be made publicly available.


Key findings
RawTFNet achieves comparable performance to state-of-the-art models on ASVspoof 2021 LA and DF datasets, but with significantly fewer parameters and lower computational cost. Ablation studies confirm the importance of both time and frequency processing branches. The model's performance is impacted by utterance duration, with optimal results around the 2-4 second range used in training.
Approach
RawTFNet uses a CNN architecture with a novel Time-Frequency Convolution (TF-Conv) module. TF-Conv separates feature processing into time and frequency branches to capture subtle details in synthetic speech. The model also incorporates depthwise separable convolutions and squeeze-and-excitation blocks to improve efficiency.
Datasets
ASVspoof 2019 LA (training), ASVspoof 2021 LA (evaluation), ASVspoof 2021 DF (evaluation)
Model(s)
RawTFNet (a CNN architecture with TF-Conv modules, depthwise separable convolutions, and squeeze-and-excitation blocks)
Author countries
Australia, Singapore