Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition

View on arXiv ← Back to list

Authors: Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Sishir Kalita, Arun Balaji Buduru, Rajesh Sharma, S. R Mahadeva Prasanna

Published: 2024-09-21 18:58:10+00:00

AI Summary

This research explores multimodal foundation models (MFMs) for non-verbal emotion recognition (NVER), hypothesizing that their joint pre-training improves accuracy. A novel fusion framework, MATA, is proposed to combine MFM representations, achieving state-of-the-art results on benchmark datasets.

Abstract

In this study, we investigate multimodal foundation models (MFMs) for emotion recognition from non-verbal sounds. We hypothesize that MFMs, with their joint pre-training across multiple modalities, will be more effective in non-verbal sounds emotion recognition (NVER) by better interpreting and differentiating subtle emotional cues that may be ambiguous in audio-only foundation models (AFMs). To validate our hypothesis, we extract representations from state-of-the-art (SOTA) MFMs and AFMs and evaluated them on benchmark NVER datasets. We also investigate the potential of combining selected foundation model representations to enhance NVER further inspired by research in speech recognition and audio deepfake detection. To achieve this, we propose a framework called MATA (Intra-Modality Alignment through Transport Attention). Through MATA coupled with the combination of MFMs: LanguageBind and ImageBind, we report the topmost performance with accuracies of 76.47%, 77.40%, 75.12% and F1-scores of 70.35%, 76.19%, 74.63% for ASVP-ESD, JNV, and VIVAE datasets against individual FMs and baseline fusion techniques and report SOTA on the benchmark datasets.

Key findings

Multimodal foundation models significantly outperform audio-only models in NVER. The proposed MATA framework further improves performance by effectively fusing representations from multiple models, achieving state-of-the-art accuracy and F1-scores on benchmark datasets. The results demonstrate the complementary nature of MFMs in NVER.

Approach

The study compares audio-only foundation models (AFMs) with MFMs for NVER. A novel fusion framework, MATA, uses optimal transport to align and integrate representations from different foundation models, enhancing performance. A downstream CNN is used for classification.

Datasets

ASVP-ESD, JNV, VIVAE, CREMA-D

Model(s)

LanguageBind, ImageBind, WavLM, Unispeech-SAT, Wav2vec2

Author countries

India, Estonia

← Previous