Where the Devil Hides: Deepfake Detectors Can No Longer Be Trusted

Authors: Shuaiwei Yuan, Junyu Dong, Yuezun Li

Published: 2025-05-13 06:09:34+00:00

AI Summary

This paper explores a security vulnerability in deepfake detectors stemming from malicious data poisoning by third-party providers. The authors develop a trigger generator to stealthily inject backdoors into these detectors, causing them to misclassify images containing specific trigger patterns.

Abstract

With the advancement of AI generative techniques, Deepfake faces have become incredibly realistic and nearly indistinguishable to the human eye. To counter this, Deepfake detectors have been developed as reliable tools for assessing face authenticity. These detectors are typically developed on Deep Neural Networks (DNNs) and trained using third-party datasets. However, this protocol raises a new security risk that can seriously undermine the trustfulness of Deepfake detectors: Once the third-party data providers insert poisoned (corrupted) data maliciously, Deepfake detectors trained on these datasets will be injected ``backdoors'' that cause abnormal behavior when presented with samples containing specific triggers. This is a practical concern, as third-party providers may distribute or sell these triggers to malicious users, allowing them to manipulate detector performance and escape accountability. This paper investigates this risk in depth and describes a solution to stealthily infect Deepfake detectors. Specifically, we develop a trigger generator, that can synthesize passcode-controlled, semantic-suppression, adaptive, and invisible trigger patterns, ensuring both the stealthiness and effectiveness of these triggers. Then we discuss two poisoning scenarios, dirty-label poisoning and clean-label poisoning, to accomplish the injection of backdoors. Extensive experiments demonstrate the effectiveness, stealthiness, and practicality of our method compared to several baselines.


Key findings
The proposed method effectively compromises deepfake detectors with high attack success rates while maintaining original accuracy on benign samples. The triggers are shown to be resistant to several baseline backdoor defense methods. The method demonstrates generalizability across different datasets and deepfake detectors.
Approach
The authors propose a method to inject backdoors into deepfake detectors by poisoning training datasets with triggers. These triggers are generated by a neural network that maps a passcode to an adaptive and invisible pattern, ensuring stealthiness and effectiveness. Two poisoning scenarios, dirty-label and clean-label, are explored.
Datasets
FaceForensics++ (FF++), Celeb-DF, DFDC
Model(s)
ResNet50, EfficientNet-b4, DenseNet, MobileNet, F3Net, SRM, NPR, FG
Author countries
China