Understanding the strengths and weaknesses of SSL models for audio deepfake model attribution

Authors: Gabriel Pîrlogeanu, Adriana Stan, Horia Cucu

Published: 2026-03-13 18:04:33+00:00

Comment: Accepted for publication at ICASSP 2026

AI Summary

This paper systematically investigates how self-supervised learning (SSL)-derived features capture architectural signatures in audio deepfakes for model attribution. By controlling multiple dimensions of the audio generation process, the authors reveal how subtle perturbations in model checkpoints, text prompts, vocoders, or speaker identity influence attribution. The study provides new insights into the robustness, biases, and limitations of SSL-based deepfake attribution, highlighting both its strengths and vulnerabilities.

Abstract

Audio deepfake model attribution aims to mitigate the misuse of synthetic speech by identifying the source model responsible for generating a given audio sample, enabling accountability and informing vendors. The task is challenging, but self-supervised learning (SSL)-derived acoustic features have demonstrated state-of-the-art attribution capabilities, yet the underlying factors driving their success and the limits of their discriminative power remain unclear. In this paper, we systematically investigate how SSL-derived features capture architectural signatures in audio deepfakes. By controlling multiple dimensions of the audio generation process we reveal how subtle perturbations in model checkpoints, text prompts, vocoders, or speaker identity influence attribution. Our results provide new insights into the robustness, biases, and limitations of SSL-based deepfake attribution, highlighting both its strengths and vulnerabilities in realistic scenarios.


Key findings
SSL models achieve high accuracy for architecture-level attribution but show weaker performance for fine-grained checkpoint attribution, especially for models that converge quickly. Attribution performance is significantly impacted by changes in vocoders and speaker identity in out-of-domain scenarios, while linguistic content primarily influences checkpoint discrimination. w2v-bert-2.0 generally demonstrates superior out-of-domain robustness and is less affected by speaker identity changes compared to wav2vec2-xls-r-2b.
Approach
The authors employ a lightweight attribution system using k-Nearest Neighbours (kNN) applied to time-domain average pooled features extracted from two self-supervised learning (SSL) models. They systematically investigate the impact of various perturbation factors, such as model checkpoints, text prompts, vocoders, and speaker identity, on the ability of these SSL features to attribute the source of synthetic audio.
Datasets
LJSpeech, HiFi-TTS
Model(s)
wav2vec2-xls-r-2b, w2v-bert-2.0, k-Nearest Neighbours (kNN)
Author countries
Romania