Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection

Authors: Yassine El Kheir, Youness Samih, Suraj Maharjan, Tim Polzehl, Sebastian Möller

Published: 2025-02-05 19:17:24+00:00

AI Summary

This paper analyzes the layer-wise contributions of self-supervised learning (SSL) models for audio deepfake detection across diverse languages and scenarios. It finds that lower layers consistently provide the most discriminative features, enabling the development of computationally efficient models with comparable performance by using only a subset of these layers.

Abstract

This paper conducts a comprehensive layer-wise analysis of self-supervised learning (SSL) models for audio deepfake detection across diverse contexts, including multilingual datasets (English, Chinese, Spanish), partial, song, and scene-based deepfake scenarios. By systematically evaluating the contributions of different transformer layers, we uncover critical insights into model behavior and performance. Our findings reveal that lower layers consistently provide the most discriminative features, while higher layers capture less relevant information. Notably, all models achieve competitive equal error rate (EER) scores even when employing a reduced number of layers. This indicates that we can reduce computational costs and increase the inference speed of detecting deepfakes by utilizing only a few lower layers. This work enhances our understanding of SSL models in deepfake detection, offering valuable insights applicable across varied linguistic and contextual settings. Our trained models and code are publicly available: https://github.com/Yaselley/SSL_Layerwise_Deepfake.


Key findings
Lower layers of SSL models consistently provide the most discriminative features for audio deepfake detection across various languages and scenarios. Using only a reduced number of lower layers achieves comparable or even better performance than using all layers, significantly reducing computational costs. The optimal number of layers varies slightly depending on the SSL model and dataset but generally ranges from 4-6 layers for small models and 10-12 for large models.
Approach
The authors employ a layer-wise analysis framework using SSL models (Wav2Vec2, Hubert, WavLM) as feature extractors and FFN or AASIST as classifiers. They systematically evaluate the contribution of each transformer layer by applying learnable weights to the layer outputs before classification, determining which layers are most important for deepfake detection.
Datasets
ASVspoof 2019 (LA19), ASVspoof 2021 (LA21, DF21), ADD23 (Track 1.2), HABLA, PartialSpoof, Half-Truth (HAD), CtrSVDD, SceneFake
Model(s)
Wav2Vec2 (small and large), Hubert (small and large), WavLM (small and large), FFN, AASIST
Author countries
Germany, UAE