AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection

View on arXiv ← Back to list

Authors: Ammarah Hashmi, Sahibzada Adil Shahzad, Chia-Wen Lin, Yu Tsao, Hsin-Min Wang

Published: 2023-10-19 19:01:26+00:00

AI Summary

AVTENet, a novel audio-visual transformer-based ensemble network, is proposed for enhanced deepfake video detection. It integrates video-only, audio-only, and audio-visual transformer networks, leveraging multimodal cues for improved accuracy, surpassing existing methods and even human performance on the FakeAVCeleb dataset.

Abstract

The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous studies on detecting artificial intelligence-generated fake videos only utilize visual modality or audio modality. While some methods exploit audio and visual modalities to detect forged videos, they have not been comprehensively evaluated on multimodal datasets of deepfake videos involving acoustic and visual manipulations, and are mostly based on convolutional neural networks with low detection accuracy. Considering that human cognition instinctively integrates multisensory information including audio and visual cues to perceive and interpret content and the success of transformer in various fields, this study introduces the audio-visual transformer-based ensemble network (AVTENet). This innovative framework tackles the complexities of deepfake technology by integrating both acoustic and visual manipulations to enhance the accuracy of video forgery detection. Specifically, the proposed model integrates several purely transformer-based variants that capture video, audio, and audio-visual salient cues to reach a consensus in prediction. For evaluation, we use the recently released benchmark multimodal audio-video FakeAVCeleb dataset. For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset. Experimental results show that the proposed model outperforms all existing methods and achieves state-of-the-art performance on Testset-I and Testset-II of the FakeAVCeleb dataset. We also compare AVTENet against humans in detecting video forgery. The results show that AVTENet significantly outperforms humans.

Key findings

AVTENet significantly outperforms existing unimodal and multimodal deepfake detection methods on the FakeAVCeleb dataset. Feature fusion shows the best ensemble strategy. AVTENet also demonstrates superior performance to human subjects in deepfake detection across different modalities.

Approach

AVTENet employs three transformer-based networks: one for video, one for audio, and one for joint audio-visual analysis. These networks are combined using ensemble learning strategies (majority voting, average score fusion, score fusion, and feature fusion) to produce a final deepfake detection prediction.

Datasets

FakeAVCeleb dataset, including Testset-I, Testset-II, faceswap, faceswap-wav2lip, fsgan, fsgan-wav2lip, RTVC, and wav2lip test sets; VoxCeleb1 and VoxCeleb2 datasets were used for data augmentation.

Model(s)

ViViT (video-only), AST (audio-only), AV-HuBERT (audio-visual), and a custom ensemble network with different fusion strategies (majority voting, average score fusion, score fusion, and feature fusion).

Author countries

Taiwan, Pakistan

← Previous