Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing

Authors: Tianchi Liu, Duc-Tuan Truong, Rohan Kumar Das, Kong Aik Lee, Haizhou Li

Published: 2025-04-08 04:11:28+00:00

AI Summary

This paper introduces Nes2Net, a lightweight architecture for speech anti-spoofing that directly processes high-dimensional features from speech foundation models without dimensionality reduction layers. This improves performance by 22% and reduces computational cost by 87% compared to state-of-the-art baselines.

Abstract

Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhead, computational costs, and risks losing valuable information. To address these issues, we propose Nested Res2Net (Nes2Net), a lightweight back-end architecture designed to directly process high-dimensional features without DR layers. The nested structure enhances multi-scale feature extraction, improves feature interaction, and preserves high-dimensional information. We first validate Nes2Net on CtrSVDD, a singing voice deepfake detection dataset, and report a 22% performance improvement and an 87% back-end computational cost reduction over the state-of-the-art baseline. Additionally, extensive testing across four diverse datasets: ASVspoof 2021, ASVspoof 5, PartialSpoof, and In-the-Wild, covering fully spoofed speech, adversarial attacks, partial spoofing, and real-world scenarios, consistently highlights Nes2Net's superior robustness and generalization capabilities. The code package and pre-trained models are available at https://github.com/Liu-Tianchi/Nes2Net.


Key findings
Nes2Net consistently outperforms state-of-the-art baselines across multiple datasets and spoofing scenarios. Nes2Net-X achieves the best reported performance on ASVspoof 2021 DF and In-the-Wild datasets with significantly reduced computational costs compared to other top-performing models. The model shows robustness to various attack types and compression conditions.
Approach
Nes2Net uses a nested Res2Net structure to directly process high-dimensional features from speech foundation models, eliminating the need for dimensionality reduction layers. This nested structure enhances multi-scale feature extraction and improves feature interaction, preserving high-dimensional information and improving efficiency.
Datasets
CtrSVDD, ASVspoof 2019, ASVspoof 2021 (LA and DF), ASVspoof 5, PartialSpoof, In-the-Wild
Model(s)
Nes2Net (and its enhanced variant Nes2Net-X), ResNet, Res2Net, ECAPA-TDNN, AASIST. WavLM and wav2vec 2.0 are used as front-end models.
Author countries
Singapore, Singapore, Singapore, Hong Kong, China