NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake Detection

Authors: Yu Chen, Yang Yu, Rongrong Ni, Yao Zhao, Haoliang Li

Published: 2023-06-12 06:06:05+00:00

AI Summary

NPVForensics is a novel deepfake detection method that leverages the correlation between non-critical phonemes and visemes in audio and video. It uses a Swin Transformer with a Local Feature Aggregation block to extract features and a Phoneme-Viseme Awareness Module for cross-modal fusion and alignment, outperforming state-of-the-art methods.

Abstract

Deepfake technologies empowered by deep learning are rapidly evolving, creating new security concerns for society. Existing multimodal detection methods usually capture audio-visual inconsistencies to expose Deepfake videos. More seriously, the advanced Deepfake technology realizes the audio-visual calibration of the critical phoneme-viseme regions, achieving a more realistic tampering effect, which brings new challenges. To address this problem, we propose a novel Deepfake detection method to mine the correlation between Non-critical Phonemes and Visemes, termed NPVForensics. Firstly, we propose the Local Feature Aggregation block with Swin Transformer (LFA-ST) to construct non-critical phoneme-viseme and corresponding facial feature streams effectively. Secondly, we design a loss function for the fine-grained motion of the talking face to measure the evolutionary consistency of non-critical phoneme-viseme. Next, we design a phoneme-viseme awareness module for cross-modal feature fusion and representation alignment, so that the modality gap can be reduced and the intrinsic complementarity of the two modalities can be better explored. Finally, a self-supervised pre-training strategy is leveraged to thoroughly learn the audio-visual correspondences in natural videos. In this manner, our model can be easily adapted to the downstream Deepfake datasets with fine-tuning. Extensive experiments on existing benchmarks demonstrate that the proposed approach outperforms state-of-the-art methods.


Key findings
NPVForensics outperforms state-of-the-art methods on various benchmark datasets. It exhibits strong cross-manipulation generalization and robustness to common video degradations. The use of self-supervised pre-training with a large amount of real video data significantly improves performance and generalization.
Approach
NPVForensics extracts features from non-critical phoneme-viseme regions using a Swin Transformer with a Local Feature Aggregation block. A Phoneme-Viseme Awareness Module fuses and aligns audio-visual features, reducing modality gaps. Self-supervised pre-training on real videos improves generalization.
Datasets
VoxCeleb2, AV Speech, FaceForensics++ (FF++), FaceShifter (FSh), Celeb-DF-v2, DeeperForensics (DFo), Deepfake Detection Challenge Dataset (DFDC), FakeAVCeleb, Audio-to-Video (A2V), Text-to-Video (T2V)
Model(s)
Swin Transformer with Local Feature Aggregation block, Phoneme-Viseme Awareness Module (including Cross Attention Fusion Module and Co-correlation Guided Representation Alignment)
Author countries
China, Hong Kong