AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies

Authors: Rui Wang, Dengpan Ye, Long Tang, Yunming Zhang, Jiacheng Deng

Published: 2024-03-22 06:04:37+00:00

AI Summary

AVT2-DWF is a novel audio-visual deepfake detection method that uses dual transformers with dynamic weight fusion to enhance both intra- and cross-modal forgery cue detection. It achieves state-of-the-art performance across multiple datasets by effectively fusing audio and visual information.

Abstract

With the continuous improvements of deepfake methods, forgery messages have transitioned from single-modality to multi-modal fusion, posing new challenges for existing forgery detection algorithms. In this paper, we propose AVT2-DWF, the Audio-Visual dual Transformers grounded in Dynamic Weight Fusion, which aims to amplify both intra- and cross-modal forgery cues, thereby enhancing detection capabilities. AVT2-DWF adopts a dual-stage approach to capture both spatial characteristics and temporal dynamics of facial expressions. This is achieved through a face transformer with an n-frame-wise tokenization strategy encoder and an audio transformer encoder. Subsequently, it uses multi-modal conversion with dynamic weight fusion to address the challenge of heterogeneous information fusion between audio and visual modalities. Experiments on DeepfakeTIMIT, FakeAVCeleb, and DFDC datasets indicate that AVT2-DWF achieves state-of-the-art performance intra- and cross-dataset Deepfake detection. Code is available at https://github.com/raining-dev/AVT2-DWF.


Key findings
AVT2-DWF achieves state-of-the-art performance on DeepfakeTIMIT, FakeAVCeleb, and DFDC datasets, both in intra- and cross-dataset evaluations. The dynamic weight fusion module and n-frame tokenization strategy significantly improve detection accuracy compared to baselines.
Approach
AVT2-DWF employs separate face and audio transformers with an n-frame tokenization strategy for visual input and MFCC features for audio. A dynamic weight fusion module combines the outputs of these transformers, weighting audio and visual information dynamically for improved deepfake detection.
Datasets
DeepfakeTIMIT, FakeAVCeleb, DFDC
Model(s)
Dual Transformers (face transformer and audio transformer), Dynamic Weight Fusion (DWF) module
Author countries
UNKNOWN