Straight Through Gumbel Softmax Estimator based Bimodal Neural Architecture Search for Audio-Visual Deepfake Detection

Authors: Aravinda Reddy PN, Raghavendra Ramachandra, Krothapalli Sreenivasa Rao, Pabitra Mitra, Vinod Rathod

Published: 2024-06-19 09:26:22+00:00

AI Summary

This paper proposes STGS-BMNAS, a novel bimodal neural architecture search framework for audio-visual deepfake detection. It uses a two-level search to optimize both unimodal feature selection and weighted multimodal fusion, achieving a high AUC of 94.4% with minimal model parameters.

Abstract

Deepfakes are a major security risk for biometric authentication. This technology creates realistic fake videos that can impersonate real people, fooling systems that rely on facial features and voice patterns for identification. Existing multimodal deepfake detectors rely on conventional fusion methods, such as majority rule and ensemble voting, which often struggle to adapt to changing data characteristics and complex patterns. In this paper, we introduce the Straight-through Gumbel-Softmax (STGS) framework, offering a comprehensive approach to search multimodal fusion model architectures. Using a two-level search approach, the framework optimizes the network architecture, parameters, and performance. Initially, crucial features were efficiently identified from backbone networks, whereas within the cell structure, a weighted fusion operation integrated information from various sources. An architecture that maximizes the classification performance is derived by varying parameters such as temperature and sampling time. The experimental results on the FakeAVCeleb and SWAN-DF datasets demonstrated an impressive AUC value 94.4% achieved with minimal model parameters.


Key findings
STGS-BMNAS achieved an AUC of 94.4% and 95.5% accuracy on audio-visual deepfake detection, outperforming several state-of-the-art methods. The model achieved this high performance using significantly fewer model parameters and GPU training days.
Approach
STGS-BMNAS employs a two-level search approach. The first level searches for optimal unimodal features from backbone networks, while the second level searches for an optimal weighted fusion strategy using a Straight-Through Gumbel-Softmax estimator to handle non-differentiability during backpropagation.
Datasets
FakeAVCeleb and SWAN-DF datasets
Model(s)
ResNet-34 (pre-trained on ImageNet and VoxCeleb), a novel architecture searched by STGS-BMNAS
Author countries
India, Norway