AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

Authors: Trevine Oorloff, Surya Koppisetti, Nicolò Bonettini, Divyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, Gaurav Bharaj

Published: 2024-06-05 05:20:12+00:00

Comment: Accepted to CVPR 2024

AI Summary

This paper introduces Audio-Visual Feature Fusion (AVFF), a two-stage cross-modal learning method designed to detect deepfake videos by explicitly capturing audio-visual correspondences. The first stage uses self-supervised representation learning on real videos with contrastive learning, autoencoding objectives, and a novel complementary masking and feature fusion strategy. The second stage fine-tunes these learned representations for supervised deepfake classification on both real and fake videos, achieving state-of-the-art performance.

Abstract

With the rapid growth in deepfake video content, we require improved and generalizable methods to detect them. Most existing detection methods either use uni-modal cues or rely on supervised training to capture the dissonance between the audio and visual modalities. While the former disregards the audio-visual correspondences entirely, the latter predominantly focuses on discerning audio-visual cues within the training corpus, thereby potentially overlooking correspondences that can help detect unseen deepfakes. We present Audio-Visual Feature Fusion (AVFF), a two-stage cross-modal learning method that explicitly captures the correspondence between the audio and visual modalities for improved deepfake detection. The first stage pursues representation learning via self-supervision on real videos to capture the intrinsic audio-visual correspondences. To extract rich cross-modal representations, we use contrastive learning and autoencoding objectives, and introduce a novel audio-visual complementary masking and feature fusion strategy. The learned representations are tuned in the second stage, where deepfake classification is pursued via supervised learning on both real and fake videos. Extensive experiments and analysis suggest that our novel representation learning paradigm is highly discriminative in nature. We report 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset, outperforming the current audio-visual state-of-the-art by 14.9% and 9.9%, respectively.


Key findings
AVFF achieves 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset, outperforming the current audio-visual state-of-the-art by 14.9% and 9.9% respectively. The method demonstrates strong generalization across unseen manipulation methods and cross-dataset evaluations. Ablation studies confirm the crucial role of the autoencoding objective, cross-modal fusion, and complementary masking for the model's performance.
Approach
AVFF is a two-stage method. The first stage performs self-supervised representation learning on real videos to capture intrinsic audio-visual correspondences, utilizing contrastive learning, autoencoding, and a novel complementary masking and cross-modal feature fusion strategy. In the second stage, these learned representations are fine-tuned via supervised learning to classify deepfakes by exploiting the lack of cohesion between audio-visual features in synthetic content.
Datasets
LRS3, FakeAVCeleb, KoDF, DF-TIMIT, DFDC
Model(s)
Transformer-based encoders and decoders (specifically from VideoMAE based on ViT-B, and AudioMAE), MARLIN (for visual encoder initialization), single-layer MLPs for A2V/V2A networks, and multi-layer MLPs for the classifier network.
Author countries
United States