An adversarial attack approach for eXplainable AI evaluation on deepfake detection models

View on arXiv ← Back to list

Authors: Balachandar Gowrisankar, Vrizlynn L. L. Thing

Published: 2023-12-08 15:19:08+00:00

AI Summary

This paper investigates the limitations of generic XAI evaluation methods for deepfake detection models and proposes a novel adversarial attack-based approach. The new approach evaluates XAI tools by assessing their ability to generate adversarial fake images using the explanations from corresponding real images.

Abstract

With the rising concern on model interpretability, the application of eXplainable AI (XAI) tools on deepfake detection models has been a topic of interest recently. In image classification tasks, XAI tools highlight pixels influencing the decision given by a model. This helps in troubleshooting the model and determining areas that may require further tuning of parameters. With a wide range of tools available in the market, choosing the right tool for a model becomes necessary as each one may highlight different sets of pixels for a given image. There is a need to evaluate different tools and decide the best performing ones among them. Generic XAI evaluation methods like insertion or removal of salient pixels/segments are applicable for general image classification tasks but may produce less meaningful results when applied on deepfake detection models due to their functionality. In this paper, we perform experiments to show that generic removal/insertion XAI evaluation methods are not suitable for deepfake detection models. We also propose and implement an XAI evaluation approach specifically suited for deepfake detection models.

Key findings

Generic removal/insertion XAI evaluation methods proved unsuitable for deepfake detection. The proposed adversarial attack approach effectively ranked XAI tools based on their faithfulness in generating adversarial examples. The results varied depending on the dataset and deepfake detection model used.

Approach

The authors propose an XAI evaluation method that uses an adversarial attack. It identifies salient visual concepts in real images using XAI tools, then perturbs the same concepts in corresponding fake images to generate adversarial examples. The effectiveness of the XAI tools is then evaluated based on the reduction in model accuracy caused by these adversarial examples.

Datasets

FaceForensics++, Celeb-DF

Model(s)

MesoNet, XceptionNet

Author countries

Singapore

← Previous