Two Views, One Truth: Spectral and Self-Supervised Features Fusion for Robust Speech Deepfake Detection

View on arXiv ← Back to list

Authors: Yassine El Kheir, Arnab Das, Enes Erdem Erdogan, Fabian Ritter-Guttierez, Tim Polzehl, Sebastian Möller

Published: 2025-07-27 21:22:27+00:00

AI Summary

This paper proposes a robust audio deepfake detection method that fuses self-supervised learning (SSL) features with handcrafted spectral features (MFCC, LFCC, CQCC). The fusion, using cross-attention mechanisms, significantly improves generalization performance compared to using only SSL features, achieving a 38% relative reduction in equal error rate (EER).

Abstract

Recent advances in synthetic speech have made audio deepfakes increasingly realistic, posing significant security risks. Existing detection methods that rely on a single modality, either raw waveform embeddings or spectral based features, are vulnerable to non spoof disturbances and often overfit to known forgery algorithms, resulting in poor generalization to unseen attacks. To address these shortcomings, we investigate hybrid fusion frameworks that integrate self supervised learning (SSL) based representations with handcrafted spectral descriptors (MFCC , LFCC, CQCC). By aligning and combining complementary information across modalities, these fusion approaches capture subtle artifacts that single feature approaches typically overlook. We explore several fusion strategies, including simple concatenation, cross attention, mutual cross attention, and a learnable gating mechanism, to optimally blend SSL features with fine grained spectral cues. We evaluate our approach on four challenging public benchmarks and report generalization performance. All fusion variants consistently outperform an SSL only baseline, with the cross attention strategy achieving the best generalization with a 38% relative reduction in equal error rate (EER). These results confirm that joint modeling of waveform and spectral views produces robust, domain agnostic representations for audio deepfake detection.

Key findings

All fusion methods outperform an SSL-only baseline. The cross-attention strategy with CQCC features achieved the best results, significantly reducing the EER across all datasets. Analysis of learnable gating weights shows that while SSL features dominate, spectral features still contribute significantly (~20%) to the detection process.

Approach

The approach uses two parallel streams to extract features: one using a pre-trained Wav2Vec2.0 XLSR-53 model for SSL features, and another extracting handcrafted spectral features. These features are then fused using various strategies (concatenation, cross-attention, mutual cross-attention, learnable gating) before being classified by an AASIST model.

Datasets

ASVspoof LA19, ASVspoof DF21, In-The-Wild (ITW), ASVspoof 5

Model(s)

Wav2Vec2.0 XLSR-53, AASIST (with various fusion strategies: concatenation, cross-attention, mutual cross-attention, learnable gating)

Author countries

Germany, Singapore

← Previous