Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

Authors: Komal Chugh, Parul Gupta, Abhinav Dhall, Ramanathan Subramanian

Published: 2020-05-29 06:09:33+00:00

AI Summary

This paper proposes a novel deepfake detection and localization method based on a Modality Dissonance Score (MDS), which quantifies the dissimilarity between audio and visual modalities. The core hypothesis is that deepfake manipulation introduces disharmony between these modalities. The approach learns discriminative features for each modality via cross-entropy loss and models inter-modality similarity using a contrastive loss, enabling both detection and temporal localization of forgeries.

Abstract

We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, eg, loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as an aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.


Key findings
The proposed method achieved state-of-the-art performance, outperforming existing techniques by up to 7% AUC on the DFDC dataset and comparable results on DeepFake-TIMIT. It successfully demonstrated temporal forgery localization, identifying manipulated video segments. The combined use of unimodal cross-entropy losses with the inter-modality contrastive loss proved crucial for enhanced detection performance.
Approach
The method computes a Modality Dissonance Score (MDS) by aggregating dissimilarity scores between 1-second audio and visual segments of a video. It employs a bi-stream network, with separate audio and visual sub-networks, trained using a combination of cross-entropy losses for individual modalities and a contrastive loss to enforce higher dissimilarity for fake videos and lower for real ones. This chunk-wise analysis also enables temporal forgery localization.
Datasets
DFDC dataset, Deepfake-TIMIT
Model(s)
Bi-stream neural network; Visual stream: 3D-ResNet inspired architecture; Audio stream: Convolutional Neural Networks (CNNs) processing Mel-frequency cepstral coefficients (MFCC) features.
Author countries
India, Australia