Detecting Audio-Visual Deepfakes with Fine-Grained Inconsistencies

View on arXiv ← Back to list

Authors: Marcella Astrid, Enjie Ghorbel, Djamila Aouada

Published: 2024-08-13 09:19:59+00:00

AI Summary

This research introduces a novel audio-visual deepfake detection method that focuses on fine-grained inconsistencies. It leverages a spatially-local model to capture subtle inconsistencies between audio and small visual regions, enhanced by an attention mechanism, and incorporates temporally-local pseudo-fake augmentation for improved generalization.

Abstract

Existing methods on audio-visual deepfake detection mainly focus on high-level features for modeling inconsistencies between audio and visual data. As a result, these approaches usually overlook finer audio-visual artifacts, which are inherent to deepfakes. Herein, we propose the introduction of fine-grained mechanisms for detecting subtle artifacts in both spatial and temporal domains. First, we introduce a local audio-visual model capable of capturing small spatial regions that are prone to inconsistencies with audio. For that purpose, a fine-grained mechanism based on a spatially-local distance coupled with an attention module is adopted. Second, we introduce a temporally-local pseudo-fake augmentation to include samples incorporating subtle temporal inconsistencies in our training set. Experiments on the DFDC and the FakeAVCeleb datasets demonstrate the superiority of the proposed method in terms of generalization as compared to the state-of-the-art under both in-dataset and cross-dataset settings.

Key findings

Experiments on DFDC and FakeAVCeleb datasets showed superior generalization performance compared to state-of-the-art methods, particularly in cross-dataset settings. Ablation studies highlighted the importance of both the spatially-local architecture and the temporally-local pseudo-fake augmentation.

Approach

The approach uses a spatially-local architecture to compare audio features with features extracted from small visual patches, calculating a distance map. An attention module refines this map by focusing on inconsistency-prone regions. Temporally-local pseudo-fake augmentation is also used to improve the model's ability to generalize.

Datasets

DFDC and FakeAVCeleb datasets

Model(s)

A modified ResNetConv3D for visual features and a Conv1D architecture for audio features. A custom classifier uses the combined audio-visual feature distance and attention maps for deepfake classification.

Author countries

Luxembourg, Tunisia

← Previous