Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization
Authors: Komal Chugh, Parul Gupta, Abhinav Dhall, Ramanathan Subramanian
Published: 2020-05-29 06:09:33+00:00
AI Summary
This paper proposes a novel deepfake detection and localization method based on a Modality Dissonance Score (MDS), which quantifies the dissimilarity between audio and visual modalities. The core hypothesis is that deepfake manipulation introduces disharmony between these modalities. The approach learns discriminative features for each modality via cross-entropy loss and models inter-modality similarity using a contrastive loss, enabling both detection and temporal localization of forgeries.
Abstract
We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, eg, loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as an aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.