Lightweight Joint Audio-Visual Deepfake Detection via Single-Stream Multi-Modal Learning Framework

Authors: Kuiyuan Zhang, Wenjie Pei, Rushi Lan, Yifang Guo, Zhongyun Hua

Published: 2025-06-09 02:13:04+00:00

AI Summary

This paper proposes SS-AVD, a lightweight single-stream multi-modal network for joint audio-visual deepfake detection, addressing the limitations of previous multi-stream models that underutilize audio-visual correlations and are computationally inefficient. It introduces a collaborative audio-visual learning (CAVL) block for continuous multi-modal feature fusion and a multi-modal classification module for enhanced robustness. The SS-AVD achieves superior detection performance across various deepfake types while being significantly more lightweight than state-of-the-art methods.

Abstract

Deepfakes are AI-synthesized multimedia data that may be abused for spreading misinformation. Deepfake generation involves both visual and audio manipulation. To detect audio-visual deepfakes, previous studies commonly employ two relatively independent sub-models to learn audio and visual features, respectively, and fuse them subsequently for deepfake detection. However, this may underutilize the inherent correlations between audio and visual features. Moreover, utilizing two isolated feature learning sub-models can result in redundant neural layers, making the overall model inefficient and impractical for resource-constrained environments. In this work, we design a lightweight network for audio-visual deepfake detection via a single-stream multi-modal learning framework. Specifically, we introduce a collaborative audio-visual learning block to efficiently integrate multi-modal information while learning the visual and audio features. By iteratively employing this block, our single-stream network achieves a continuous fusion of multi-modal features across its layers. Thus, our network efficiently captures visual and audio features without the need for excessive block stacking, resulting in a lightweight network design. Furthermore, we propose a multi-modal classification module that can boost the dependence of the visual and audio classifiers on modality content. It also enhances the whole resistance of the video classifier against the mismatches between audio and visual modalities. We conduct experiments on the DF-TIMIT, FakeAVCeleb, and DFDC benchmark datasets. Compared to state-of-the-art audio-visual joint detection methods, our method is significantly lightweight with only 0.48M parameters, yet it achieves superiority in both uni-modal and multi-modal deepfakes, as well as in unseen types of deepfakes.


Key findings
The SS-AVD achieves superior deepfake detection accuracy and AUC scores across uni-modal and multi-modal deepfakes, as well as unseen types, on DF-TIMIT, FakeAVCeleb, and DFDC datasets. It is significantly lightweight with only 0.48M parameters, which is substantially fewer than comparable state-of-the-art audio-visual joint detection methods (typically >5M parameters).
Approach
The proposed SS-AVD employs a single-stream multi-modal learning framework that iteratively fuses audio and visual features across its layers using a Collaborative Audio-Visual Learning (CAVL) block. This block integrates a Visual Preprocessing Module (VPM) and a Self-Attention Based Audio-Visual Module (SAAVM) to efficiently learn interactive features. Furthermore, a multi-modal classification module with Multi-Modal Style-Shuffle Augmentation (MMSSA) and Latent-Shuffle Augmentation (LSA) strategies is used to boost content dependence and enhance robustness against modality mismatches.
Datasets
DF-TIMIT, FakeAVCeleb, DFDC, VoxCeleb2 (for complementing FakeAVCeleb)
Model(s)
SS-AVD (a custom lightweight network), which includes a Collaborative Audio-Visual Learning (CAVL) block comprising a Visual Preprocessing Module (VPM) and a Self-Attention Based Audio-Visual Module (SAAVM).
Author countries
China