AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

Authors: Trevine Oorloff, Surya Koppisetti, Nicolò Bonettini, Divyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, Gaurav Bharaj

Published: 2024-06-05 05:20:12+00:00

AI Summary

AVFF is a two-stage deepfake detection method that leverages audio-visual correspondences for improved accuracy. The first stage uses self-supervised learning on real videos to capture these correspondences, while the second stage performs supervised deepfake classification.

Abstract

With the rapid growth in deepfake video content, we require improved and generalizable methods to detect them. Most existing detection methods either use uni-modal cues or rely on supervised training to capture the dissonance between the audio and visual modalities. While the former disregards the audio-visual correspondences entirely, the latter predominantly focuses on discerning audio-visual cues within the training corpus, thereby potentially overlooking correspondences that can help detect unseen deepfakes. We present Audio-Visual Feature Fusion (AVFF), a two-stage cross-modal learning method that explicitly captures the correspondence between the audio and visual modalities for improved deepfake detection. The first stage pursues representation learning via self-supervision on real videos to capture the intrinsic audio-visual correspondences. To extract rich cross-modal representations, we use contrastive learning and autoencoding objectives, and introduce a novel audio-visual complementary masking and feature fusion strategy. The learned representations are tuned in the second stage, where deepfake classification is pursued via supervised learning on both real and fake videos. Extensive experiments and analysis suggest that our novel representation learning paradigm is highly discriminative in nature. We report 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset, outperforming the current audio-visual state-of-the-art by 14.9% and 9.9%, respectively.


Key findings
AVFF achieves state-of-the-art results on the FakeAVCeleb dataset, outperforming existing methods by significant margins in accuracy and AUC. The method demonstrates strong generalization capabilities across different deepfake manipulation techniques and datasets. Analysis of the learned representations shows clear separation between real and fake videos.
Approach
AVFF uses a two-stage approach. The first stage employs self-supervised learning with contrastive and autoencoding objectives, along with a novel complementary masking and feature fusion strategy to learn audio-visual representations from real videos. The second stage uses these learned representations to train a supervised classifier for deepfake detection.
Datasets
LRS3 (for representation learning), FakeAVCeleb (for deepfake classification), KoDF (for cross-dataset generalization), DF-TIMIT (for cross-dataset generalization), DFDC (for cross-dataset generalization)
Model(s)
Transformer-based encoders and decoders for audio and video, Multilayer Perceptrons (MLPs) for cross-modal conversion (A2V and V2A), and a classifier network (MLP or SVM) for deepfake classification.
Author countries
USA