Gumbel Rao Monte Carlo based Bi-Modal Neural Architecture Search for Audio-Visual Deepfake Detection

Authors: Aravinda Reddy PN, Raghavendra Ramachandra, Krothapalli Sreenivasa Rao, Pabitra Mitra Vinod Rathod

Published: 2024-10-09 04:37:35+00:00

AI Summary

This paper introduces Gumbel-Rao Monte Carlo Bi-modal Neural Architecture Search (GRMC-BMNAS), a novel framework for audio-visual deepfake detection. It optimizes multimodal fusion by refining the Straight through Gumbel Softmax (STGS) method with Rao-Blackwellization to reduce variance, leading to more stable training. The two-level search approach concurrently optimizes network architecture, parameters, and performance, demonstrating superior generalization and efficiency.

Abstract

Deepfakes pose a critical threat to biometric authentication systems by generating highly realistic synthetic media. Existing multimodal deepfake detectors often struggle to adapt to diverse data and rely on simple fusion methods. To address these challenges, we propose Gumbel-Rao Monte Carlo Bi-modal Neural Architecture Search (GRMC-BMNAS), a novel architecture search framework that employs Gumbel-Rao Monte Carlo sampling to optimize multimodal fusion. It refines the Straight through Gumbel Softmax (STGS) method by reducing variance with Rao-Blackwellization, stabilizing network training. Using a two-level search approach, the framework optimizes the network architecture, parameters, and performance. Crucial features are efficiently identified from backbone networks, while within the cell structure, a weighted fusion operation integrates information from various sources. By varying parameters such as temperature and number of Monte carlo samples yields an architecture that maximizes classification performance and better generalisation capability. Experimental results on the FakeAVCeleb and SWAN-DF datasets demonstrate an impressive AUC percentage of 95.4\\%, achieved with minimal model parameters.


Key findings
Experimental results show GRMC-BMNAS achieves an impressive AUC of 95.4% on FakeAVCeleb and SWAN-DF datasets with minimal model parameters (0.20M) and reduced GPU training days (1.5). The proposed method outperforms existing state-of-the-art approaches, including STGS-BMNAS, by demonstrating superior generalization capabilities on unseen data and faster convergence due to lower variance and mean squared error.
Approach
The GRMC-BMNAS framework utilizes Gumbel-Rao Monte Carlo sampling within a two-level neural architecture search. It refines the Gumbel-Softmax gradient estimator by incorporating Rao-Blackwellization to reduce variance and stabilize training. The first level identifies crucial features from backbone networks, while the second level optimizes a weighted fusion operation within the cell structure to integrate information from various sources.
Datasets
FakeAVCeleb, SWAN-DF
Model(s)
GRMC-BMNAS, ResNet-34 (as feature extraction backbones)
Author countries
India, Norway