DiffBreak: Is Diffusion-Based Purification Robust?

Authors: Andre Kassis, Urs Hengartner, Yaoliang Yu

Published: 2024-11-25 17:30:32+00:00

AI Summary

This paper challenges the robustness of diffusion-based purification (DBP) for adversarial example defense, proving that gradient-based attacks effectively target the diffusion model itself, not just the classifier. It introduces DiffBreak, a toolkit for reliable evaluation of DBP, demonstrating its vulnerability even with improved evaluation protocols.

Abstract

Diffusion-based purification (DBP) has become a cornerstone defense against adversarial examples (AEs), regarded as robust due to its use of diffusion models (DMs) that project AEs onto the natural data manifold. We refute this core claim, theoretically proving that gradient-based attacks effectively target the DM rather than the classifier, causing DBP's outputs to align with adversarial distributions. This prompts a reassessment of DBP's robustness, attributing it to two critical flaws: incorrect gradients and inappropriate evaluation protocols that test only a single random purification of the AE. We show that with proper accounting for stochasticity and resubmission risk, DBP collapses. To support this, we introduce DiffBreak, the first reliable toolkit for differentiation through DBP, eliminating gradient flaws that previously further inflated robustness estimates. We also analyze the current defense scheme used for DBP where classification relies on a single purification, pinpointing its inherent invalidity. We provide a statistically grounded majority-vote (MV) alternative that aggregates predictions across multiple purified copies, showing partial but meaningful robustness gain. We then propose a novel adaptation of an optimization method against deepfake watermarking, crafting systemic perturbations that defeat DBP even under MV, challenging DBP's viability.


Key findings
DiffBreak reveals that DBP's previously reported robustness is significantly inflated due to flawed gradient calculations and evaluation protocols. Even with corrected gradients and a statistically sound majority-vote evaluation, DBP remains highly vulnerable to a novel low-frequency attack. The results strongly challenge the viability of DBP as a robust defense against adversarial examples.
Approach
The authors theoretically and empirically analyze the vulnerability of DBP to gradient-based attacks. They introduce DiffBreak, a toolkit that corrects flaws in existing gradient computation and evaluation protocols for DBP, enabling more reliable assessment of its robustness. A novel low-frequency attack is also proposed to further demonstrate DBP's limitations.
Datasets
ImageNet, CIFAR-10
Model(s)
WideResNet-28-10, WideResNet-70-16, WideResNet-50-2, DeiT-S, ResNet-50
Author countries
Canada