How Close are Other Computer Vision Tasks to Deepfake Detection?

Authors: Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

Published: 2023-10-02 06:32:35+00:00

AI Summary

This paper challenges the common practice of using ImageNet-trained models for deepfake detection, proposing a new "backbone separability" metric to assess model capacity for separating real and fake images. A systematic benchmark reveals that face recognition models and self-supervised models are more effective than ImageNet-trained models, though fine-tuning risks overfitting.

Abstract

In this paper, we challenge the conventional belief that supervised ImageNet-trained models have strong generalizability and are suitable for use as feature extractors in deepfake detection. We present a new measurement, model separability, for visually and quantitatively assessing a model's raw capacity to separate data in an unsupervised manner. We also present a systematic benchmark for determining the correlation between deepfake detection and other computer vision tasks using pre-trained models. Our analysis shows that pre-trained face recognition models are more closely related to deepfake detection than other models. Additionally, models trained using self-supervised methods are more effective in separation than those trained using supervised methods. After fine-tuning all models on a small deepfake dataset, we found that self-supervised models deliver the best results, but there is a risk of overfitting. Our results provide valuable insights that should help researchers and practitioners develop more effective deepfake detection models.


Key findings
Face recognition models and self-supervised models showed better performance than ImageNet-trained models for deepfake detection. Fine-tuning improved results but increased the risk of overfitting, reducing generalizability. The size and annotation detail of the pre-training dataset significantly impacted performance.
Approach
The authors introduce a new metric, backbone separability, to measure a model's ability to separate real and fake images unsupervised. They benchmark various pre-trained models (face recognition, age estimation, image classification, and self-supervised learning) on deepfake detection, evaluating them before and after fine-tuning on a small dataset.
Datasets
VidTIMIT, VoxCeleb2, FaceForensics++, Google DFD dataset, Deepfake Detection Challenge Dataset (DFDC), Celeb-DF dataset, DeepfakeTIMIT (DF-TIMIT) dataset, YouTube-DF dataset, Glint360K, UTK, MS-Celeb-1M, VGGFace2, ImageNet-1K, ImageNet-21K, datasets constructed by Afchar et al. and images generated using a latent diffusion model.
Model(s)
VGG-16, MWR (global), ResNet-50, BarlowTwins, BYOL, SimCLRv2, iResNet-101, CosFace, ArcFace, Partial FC, FaceNet, Incep.-ResNet-v1, Incep.-ResNet-v2, ResNet-101, Xception, EfficientNet, EfficientNet-v2, DeiT III
Author countries
Japan