Unsupervised Multimodal Deepfake Detection Using Intra- and Cross-Modal Inconsistencies

View on arXiv ← Back to list

Authors: Mulin Tian, Mahyar Khayatkhoei, Joe Mathai, Wael AbdAlmageed

Published: 2023-11-28 03:28:19+00:00

AI Summary

This paper presents a novel unsupervised method for deepfake video detection by identifying intra- and cross-modal inconsistencies between video segments. The method leverages the inherent trade-off in deepfake generation between preserving identity and accurately transferring motion, resulting in detectable inconsistencies. This approach outperforms existing unsupervised methods on the FakeAVCeleb dataset.

Abstract

Deepfake videos present an increasing threat to society with potentially negative impact on criminal justice, democracy, and personal safety and privacy. Meanwhile, detecting deepfakes, at scale, remains a very challenging task that often requires labeled training data from existing deepfake generation methods. Further, even the most accurate supervised deepfake detection methods do not generalize to deepfakes generated using new generation methods. In this paper, we propose a novel unsupervised method for detecting deepfake videos by directly identifying intra-modal and cross-modal inconsistency between video segments. The fundamental hypothesis behind the proposed detection method is that motion or identity inconsistencies are inevitable in deepfake videos. We will mathematically and empirically support this hypothesis, and then proceed to constructing our method grounded in our theoretical analysis. Our proposed method outperforms prior state-of-the-art unsupervised deepfake detection methods on the challenging FakeAVCeleb dataset, and also has several additional advantages: it is scalable because it does not require pristine (real) samples for each identity during inference and therefore can apply to arbitrarily many identities, generalizable because it is trained only on real videos and therefore does not rely on a particular deepfake method, reliable because it does not rely on any likelihood estimation in high dimensions, and explainable because it can pinpoint the exact location of modality inconsistencies which are then verifiable by a human expert.

Key findings

The proposed unsupervised method outperforms state-of-the-art unsupervised deepfake detection methods on FakeAVCeleb. It generalizes well to different languages and compression levels, even under adversarial attacks. The combined intra- and cross-modal approach demonstrates complementary strengths in detecting different types of deepfakes.

Approach

The method detects deepfakes by identifying inconsistencies within and between audio and video modalities. It uses neural networks to extract identity, visual, and audio features from video segments and calculates intra-modal (identity) and cross-modal (audio-visual) consistency scores. A final deepfake score is obtained by combining these scores.

Datasets

VoxCeleb2 (training), FakeAVCeleb (evaluation), KoDF (generalization)

Model(s)

AdaFace (pre-trained on MS1MV2, MS1MV3, and WebFace4M for identity features), Transformer encoder (for visual feature aggregation), Whisper (pre-trained on 680k hours of audio-text data for audio features).

Author countries

USA

← Previous