DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization

View on arXiv ← Back to list

Authors: Christos Koutlis, Symeon Papadopoulos

Published: 2024-11-15 13:47:33+00:00

AI Summary

DiMoDif is a novel audio-visual deepfake detection framework that leverages cross-modal inconsistencies in speech recognition outputs to detect and temporally localize deepfakes. It outperforms state-of-the-art methods on several benchmark datasets, achieving significant improvements in both deepfake detection and forgery localization.

Abstract

Deepfake technology has rapidly advanced and poses significant threats to information integrity and trust in online multimedia. While significant progress has been made in detecting deepfakes, the simultaneous manipulation of audio and visual modalities, sometimes at small parts or in subtle ways, presents highly challenging detection scenarios. To address these challenges, we present DiMoDif, an audio-visual deepfake detection framework that leverages the inter-modality differences in machine perception of speech, based on the assumption that in real samples -- in contrast to deepfakes -- visual and audio signals coincide in terms of information. DiMoDif leverages features from deep networks that specialize in visual and audio speech recognition to spot frame-level cross-modal incongruities, and in that way to temporally localize the deepfake forgery. To this end, we devise a hierarchical cross-modal fusion network, integrating adaptive temporal alignment modules and a learned discrepancy mapping layer to explicitly model the subtle differences between visual and audio representations. Then, the detection model is optimized through a composite loss function accounting for frame-level detections and fake intervals localization. DiMoDif outperforms the state-of-the-art on the Deepfake Detection task by 30.5 AUC on the highly challenging AV-Deepfake1M, while it performs exceptionally on FakeAVCeleb and LAV-DF. On the Temporal Forgery Localization task, it outperforms the state-of-the-art by 47.88 AP@0.75 on AV-Deepfake1M, and performs on-par on LAV-DF. Code available at https://github.com/mever-team/dimodif.

Key findings

DiMoDif significantly outperforms state-of-the-art methods on the AV-Deepfake1M dataset, improving AUC for deepfake detection by 30.5% and AP@0.75 for temporal forgery localization by 47.88%. It also achieves top performance on FakeAVCeleb and LAV-DF datasets.

Approach

DiMoDif uses pre-trained visual and audio speech recognition models to extract features from video and audio streams. A hierarchical cross-modal fusion network then identifies frame-level inconsistencies, enabling both deepfake detection and temporal localization of forgeries. The model is trained using a composite loss function that considers frame-level detections and fake interval localization.

Datasets

FakeAVCeleb, LAV-DF, AV-Deepfake1M, VoxCeleb2 (for additional real samples)

Model(s)

Pre-trained Visual Speech Recognition (VSR) and Audio Speech Recognition (ASR) models (specifically mentioning (Ma 2022) [46] and AV-Hubert [63] models), and a hierarchical cross-modal fusion network based on a Transformer encoder with local cross-modal attention and feature pyramids.

Author countries

Greece

← Previous