Multimodal Zero-Shot Framework for Deepfake Hate Speech Detection in Low-Resource Languages

View on arXiv ← Back to list

Authors: Rishabh Ranjan, Likhith Ayinala, Mayank Vatsa, Richa Singh

Published: 2025-06-10 02:37:42+00:00

AI Summary

This paper proposes a multimodal framework for zero-shot hate speech detection in deepfake audio, particularly focusing on low-resource languages. It leverages contrastive learning to jointly align audio and text representations across languages and introduces a new benchmark dataset with paired text and synthesized speech samples in six languages.

Abstract

This paper introduces a novel multimodal framework for hate speech detection in deepfake audio, excelling even in zero-shot scenarios. Unlike previous approaches, our method uses contrastive learning to jointly align audio and text representations across languages. We present the first benchmark dataset with 127,290 paired text and synthesized speech samples in six languages: English and five low-resource Indian languages (Hindi, Bengali, Marathi, Tamil, Telugu). Our model learns a shared semantic embedding space, enabling robust cross-lingual and cross-modal classification. Experiments on two multilingual test sets show our approach outperforms baselines, achieving accuracies of 0.819 and 0.701, and generalizes well to unseen languages. This demonstrates the advantage of combining modalities for hate speech detection in synthetic media, especially in low-resource settings where unimodal models falter. The Dataset is available at https://www.iab-rubric.org/resources.

Key findings

The proposed multimodal model outperforms unimodal baselines in both same-language and cross-language (zero-shot) scenarios, achieving accuracies of 0.819 and 0.701 on two multilingual test sets. The results highlight the benefits of multimodal approaches for hate speech detection, especially in low-resource settings.

Approach

The framework uses contrastive learning to align audio and text representations from state-of-the-art encoders (SONAR for text, SeamlessM4T for audio) into a shared embedding space. A classifier is then trained on this shared space to detect hate speech, handling cross-lingual and cross-modal classification.

Datasets

A new benchmark dataset with 127,290 paired text and synthesized speech samples in English and five low-resource Indian languages (Hindi, Bengali, Marathi, Tamil, Telugu). This dataset uses existing text corpora converted to audio using Meta's Massive Multi-Lingual model.

Model(s)

SONAR and SeamlessM4T encoders are used for text and audio, respectively. A classifier is trained on top of the combined embeddings using a combination of triplet loss and binary cross-entropy loss.

Author countries

India, USA

← Previous