The Affective Bridge: Unifying Feature Representations for Speech Deepfake Detection

Authors: Yupei Li, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang, Björn W. Schuller

Published: 2025-12-12 02:49:18+00:00

AI Summary

The paper introduces "EmoBridge," a novel training framework for speech deepfake detection that unifies diverse feature representations by leveraging emotion as a bridging mechanism. This approach integrates emotion-related characteristics into existing feature encoders through a continual learning strategy, aiming for a robust and interpretable feature space. EmoBridge consistently improves deepfake detection performance across various datasets and feature types.

Abstract

Speech deepfake detection has been widely explored using low-level acoustic descriptors. However, each study tends to select different feature sets, making it difficult to establish a unified representation for the task. Moreover, such features are not intuitive for humans to perceive, as the distinction between bona fide and synthesized speech becomes increasingly subtle with the advancement of deepfake generation techniques. Emotion, on the other hand, remains a unique human attribute that current deepfake generator struggles to fully replicate, reflecting the gap toward true artificial general intelligence. Interestingly, many existing acoustic and semantic features have implicit correlations with emotion. For instance, speech features recognized by automatic speech recognition systems often varies naturally with emotional expression. Based on this insight, we propose a novel training framework that leverages emotion as a bridge between conventional deepfake features and emotion-oriented representations. Experiments on the widely used FakeOrReal and In-the-Wild datasets demonstrate consistent and substantial improvements in accuracy, up to approximately 6% and 2% increases, respectively, and in equal error rate (EER), showing reductions of up to about 4% and 1%, respectively, while achieving comparable results on ASVspoof2019. This approach provides a unified training strategy for all features and interpretable feature direction for deepfake detection while improving model performance through emotion-informed learning.


Key findings
The EmoBridge strategy consistently and substantially improved deepfake detection performance on FakeOrReal and In-the-Wild datasets, with accuracy increases of up to 6% and 2% and EER reductions of up to 4% and 1%, respectively, while maintaining comparable performance on ASVSpoof2019. The benefits were most pronounced for deep-learned and human-speech-based deepfake features, suggesting emotion-related information is a distinctive cue that current deepfake audio models struggle to reproduce accurately.
Approach
The EmoBridge framework aligns affective cues through a continual learning process where the encoder of a pre-trained model (e.g., for ASR, SV, or raw deep learning features) is further trained on an emotion recognition task. This fuses emotion-related features into the encoder's representations, which are then used as inputs to a Support Vector Machine (SVM) classifier for the final deepfake detection task.
Datasets
FakeOrReal (FoR), In-the-Wild (ITW), ASVSpoof2019 LA, Toronto Emotional Speech Set (TESS), Surrey Audio-Visual Expressed Emotion (SAVEE), CREMA-D, RAVDESS, Emotion Speech Dataset (ESD), IEMOCAP
Model(s)
openSMILE, Whisper, SpeechT5, WavLM, HuBERT, Support Vector Machine (SVM), three-layer fully connected network
Author countries
UK, Germany, China