Gumbel Rao Monte Carlo based Bi-Modal Neural Architecture Search for Audio-Visual Deepfake Detection

View on arXiv ← Back to list

Authors: Aravinda Reddy PN, Raghavendra Ramachandra, Krothapalli Sreenivasa Rao, Pabitra Mitra Vinod Rathod

Published: 2024-10-09 04:37:35+00:00

AI Summary

The paper introduces GRMC-BMNAS, a novel bi-modal neural architecture search framework for audio-visual deepfake detection. It uses Gumbel-Rao Monte Carlo sampling to optimize multimodal fusion, improving stability and generalization compared to previous methods. The resulting model achieves a high AUC of 95.4% with minimal parameters.

Abstract

Deepfakes pose a critical threat to biometric authentication systems by generating highly realistic synthetic media. Existing multimodal deepfake detectors often struggle to adapt to diverse data and rely on simple fusion methods. To address these challenges, we propose Gumbel-Rao Monte Carlo Bi-modal Neural Architecture Search (GRMC-BMNAS), a novel architecture search framework that employs Gumbel-Rao Monte Carlo sampling to optimize multimodal fusion. It refines the Straight through Gumbel Softmax (STGS) method by reducing variance with Rao-Blackwellization, stabilizing network training. Using a two-level search approach, the framework optimizes the network architecture, parameters, and performance. Crucial features are efficiently identified from backbone networks, while within the cell structure, a weighted fusion operation integrates information from various sources. By varying parameters such as temperature and number of Monte carlo samples yields an architecture that maximizes classification performance and better generalisation capability. Experimental results on the FakeAVCeleb and SWAN-DF datasets demonstrate an impressive AUC percentage of 95.4%, achieved with minimal model parameters.

Key findings

GRMC-BMNAS achieves a 95.4% AUC on audio-visual deepfake detection, outperforming state-of-the-art methods. The model demonstrates superior generalization to unseen datasets. The approach also achieves this high performance with fewer model parameters and faster training compared to existing methods.

Approach

GRMC-BMNAS employs a two-level search. The first level searches for optimal unimodal feature extraction from backbone networks, and the second level optimizes a weighted fusion operation integrating audio and video features using Gumbel-Rao Monte Carlo sampling to reduce variance and improve training stability.

Datasets

FakeAVCeleb and SWAN-DF datasets

Model(s)

ResNet-34 (pre-trained) for feature extraction; a novel architecture searched by GRMC-BMNAS

Author countries

India, Norway

← Previous