Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos

Authors: Pengfei Pei, Xianfeng Zhao, Yun Cao, Jinchuan Li, Xuyuan Lai

Published: 2021-12-15 13:35:55+00:00

AI Summary

This paper proposes ViTHash, a vision transformer-based video hashing retrieval method for tracing the source of fake videos, offering more reliable evidence than conventional detection methods. It introduces a novel Hash Triplet Loss and a Localizator tool to improve accuracy and localization of differences between original and fake videos.

Abstract

In recent years, the spread of fake videos has brought great influence on individuals and even countries. It is important to provide robust and reliable results for fake videos. The results of conventional detection methods are not reliable and not robust for unseen videos. Another alternative and more effective way is to find the original video of the fake video. For example, fake videos from the Russia-Ukraine war and the Hong Kong law revision storm are refuted by finding the original video. We use an improved retrieval method to find the original video, named ViTHash. Specifically, tracing the source of fake videos requires finding the unique one, which is difficult when there are only small differences in the original videos. To solve the above problems, we designed a novel loss Hash Triplet Loss. In addition, we designed a tool called Localizator to compare the difference between the original traced video and the fake video. We have done extensive experiments on FaceForensics++, Celeb-DF and DeepFakeDetection, and we also have done additional experiments on our built three datasets: DAVIS2016-TL (video inpainting), VSTL (video splicing) and DFTL (similar videos). Experiments have shown that our performance is better than state-of-the-art methods, especially in cross-dataset mode. Experiments also demonstrated that ViTHash is effective in various forgery detection: video inpainting, video splicing and deepfakes. Our code and datasets have been released on GitHub: url{https://github.com/lajlksdf/vtl}.


Key findings
ViTHash outperforms state-of-the-art methods, particularly in cross-dataset scenarios. It demonstrates effectiveness across various forgery types (deepfakes, video inpainting, video splicing). The Localizator tool aids in identifying discrepancies between original and fake videos.
Approach
ViTHash uses a Vision Transformer architecture to generate video hash codes. A novel Hash Triplet Loss function is designed to improve the discrimination between original and fake videos by learning hash centers. A Localizator tool visually compares the original and fake videos.
Datasets
FaceForensics++, Celeb-DF, DeepFakeDetection, DAVIS2016-TL (video inpainting), VSTL (video splicing), DFTL (similar videos)
Model(s)
Vision Transformer (ViT) based architecture, specifically using spatio-temporal Pyramid Vision Transformer (PVTv2) and multiple Attention Blocks. A CNN-ViT mixed structure is also used in the Localizator tool.
Author countries
China