What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

Authors: Petr Grinberg, Ankur Kumar, Surya Koppisetti, Gaurav Bharaj

Published: 2025-01-23 18:00:14+00:00

Comment: Accepted to ICASSP 2025

Journal Ref: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2025, pp. 1-5

AI Summary

This paper introduces Gradient Average Transformer Relevancy (GATR), a novel explainable AI (XAI) method for interpreting transformer-based audio deepfake detection (ADD) models in the time domain. GATR is quantitatively shown to outperform existing XAI techniques like Grad-CAM and SHAP-based methods on various faithfulness metrics when evaluating explanations on large datasets. The study highlights that XAI methods differ significantly in their interpretations and that conclusions about detector focus (e.g., speech/non-speech regions, phonetic content) derived from limited utterances may not generalize across entire datasets or different acoustic conditions.

Abstract

Adding explanations to audio deepfake detection (ADD) models will boost their real-world application by providing insight on the decision making process. In this paper, we propose a relevancy-based explainable AI (XAI) method to analyze the predictions of transformer-based ADD models. We compare against standard Grad-CAM and SHAP-based methods, using quantitative faithfulness metrics as well as a partial spoof test, to comprehensively analyze the relative importance of different temporal regions in an audio. We consider large datasets, unlike previous works where only limited utterances are studied, and find that the XAI methods differ in their explanations. The proposed relevancy-based XAI method performs the best overall on a variety of metrics. Further investigation on the relative importance of speech/non-speech, phonetic content, and voice onsets/offsets suggest that the XAI results obtained from analyzing limited utterances don't necessarily hold when evaluated on large datasets.


Key findings
The proposed GATR method demonstrates superior performance in faithfulness metrics compared to Grad-CAM, DeepSHAP, and GradientSHAP, offering more reliable explanations for ADD model decisions. The study reveals that different XAI methods yield conflicting interpretations of what ADD models focus on, and that insights from analyzing limited audio samples do not consistently generalize to larger datasets or different acoustic conditions. Notably, non-speech regions are found to be highly important for bona fide audio, while for spoofed audio, the importance of specific regions like unstressed vowels or voice onsets varies significantly between datasets.
Approach
The authors propose the Gradient Average Transformer Relevancy (GATR) method, which modifies existing relevancy map techniques for transformer models by using gradient-weighted averaging to attribute importance scores to raw audio timesteps, specifically designed for ADD models lacking a [CLS] token like Wav2Vec2-AASIST. They compare GATR against Grad-CAM and SHAP variants using quantitative faithfulness metrics, perturbation tests, and a partial spoof test, further employing a Relative Contribution Quantification (RCQ) metric for automated, dataset-level hypothesis testing on the importance of different audio regions.
Datasets
ASVspoof 2019 (ASV19), In-The-Wild (ITW), PartialSpoof
Model(s)
Wav2Vec2-AASIST (a transformer-based audio deepfake detection model)
Author countries
Switzerland, USA