Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization

Authors: Zhixi Cai, Kalin Stefanov, Abhinav Dhall, Munawar Hayat

Published: 2022-04-13 08:02:11+00:00

AI Summary

This paper introduces LAV-DF, a new large-scale audio-visual deepfake dataset with content-driven manipulations designed for temporal forgery localization. A novel multimodal method, BA-TFD, is proposed for accurately predicting the boundaries of fake segments using audio and video information.

Abstract

Due to its high societal impact, deepfake detection is getting active attention in the computer vision community. Most deepfake detection methods rely on identity, facial attributes, and adversarial perturbation-based spatio-temporal modifications at the whole video or random locations while keeping the meaning of the content intact. However, a sophisticated deepfake may contain only a small segment of video/audio manipulation, through which the meaning of the content can be, for example, completely inverted from a sentiment perspective. We introduce a content-driven audio-visual deepfake dataset, termed Localized Audio Visual DeepFake (LAV-DF), explicitly designed for the task of learning temporal forgery localization. Specifically, the content-driven audio-visual manipulations are performed strategically to change the sentiment polarity of the whole video. Our baseline method for benchmarking the proposed dataset is a 3DCNN model, termed as Boundary Aware Temporal Forgery Detection (BA-TFD), which is guided via contrastive, boundary matching, and frame classification loss functions. Our extensive quantitative and qualitative analysis demonstrates the proposed method's strong performance for temporal forgery localization and deepfake detection tasks.


Key findings
The proposed BA-TFD method significantly outperforms existing state-of-the-art methods on the LAV-DF dataset for temporal forgery localization. The method also shows competitive performance on deepfake classification tasks, even though not primarily designed for this purpose. The results highlight the effectiveness of the multimodal approach and the importance of the new dataset for advancing research in this area.
Approach
The proposed BA-TFD method uses a 3DCNN for video and a 2DCNN for audio feature extraction. These features are then processed using contrastive, boundary matching, and frame classification loss functions to localize temporal forgeries. A multimodal fusion module combines audio and video information for final boundary prediction.
Datasets
Localized Audio Visual DeepFake (LAV-DF), VoxCeleb2 (for sourcing real videos), DFDC (for comparison)
Model(s)
3DCNN (for video), 2DCNN (for audio), Boundary Aware Temporal Forgery Detection (BA-TFD)
Author countries
Australia, India