Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

Authors: Komal Chugh, Parul Gupta, Abhinav Dhall, Ramanathan Subramanian

Published: 2020-05-29 06:09:33+00:00

AI Summary

This paper proposes a deepfake detection method based on the Modality Dissonance Score (MDS), which quantifies the dissimilarity between audio and visual modalities. The hypothesis is that manipulated videos exhibit greater audio-visual disharmony, allowing for effective classification and temporal forgery localization.

Abstract

We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, eg, loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as an aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.


Key findings
The proposed method outperforms state-of-the-art techniques by up to 7% on the DFDC dataset. It also demonstrates successful temporal forgery localization by identifying manipulated video segments. The use of both contrastive and cross-entropy losses improves performance significantly.
Approach
The approach uses a bi-stream network with separate audio and visual streams. A contrastive loss is used to model the inter-modality similarity, while cross-entropy losses are employed for individual modality classification. The Modality Dissonance Score (MDS) aggregates dissimilarity scores across 1-second video chunks.
Datasets
DFDC and DeepFake-TIMIT datasets
Model(s)
A bi-stream network using 3D-ResNet for the visual stream and a CNN for the audio stream. MFCC features are used for the audio input.
Author countries
India, Australia