What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

View on arXiv ← Back to list

Authors: Petr Grinberg, Ankur Kumar, Surya Koppisetti, Gaurav Bharaj

Published: 2025-01-23 18:00:14+00:00

AI Summary

This paper proposes a relevancy-based explainable AI (XAI) method, Gradient Average Transformer Relevancy (GATR), to analyze predictions of transformer-based audio deepfake detection models. GATR outperforms existing XAI methods (Grad-CAM, SHAP) in faithfulness metrics and a partial spoof test, providing insights into the models' decision-making process on large datasets.

Abstract

Adding explanations to audio deepfake detection (ADD) models will boost their real-world application by providing insight on the decision making process. In this paper, we propose a relevancy-based explainable AI (XAI) method to analyze the predictions of transformer-based ADD models. We compare against standard Grad-CAM and SHAP-based methods, using quantitative faithfulness metrics as well as a partial spoof test, to comprehensively analyze the relative importance of different temporal regions in an audio. We consider large datasets, unlike previous works where only limited utterances are studied, and find that the XAI methods differ in their explanations. The proposed relevancy-based XAI method performs the best overall on a variety of metrics. Further investigation on the relative importance of speech/non-speech, phonetic content, and voice onsets/offsets suggest that the XAI results obtained from analyzing limited utterances don't necessarily hold when evaluated on large datasets.

Key findings

GATR outperforms other XAI methods in various faithfulness metrics. Analysis reveals that non-speech regions are crucial for bona fide audio classification, while unstressed vowels significantly impact spoofed audio detection. However, some findings (e.g., the importance of voice onsets/offsets) do not generalize across datasets.

Approach

The authors propose GATR, which modifies relevancy maps to analyze transformer-based audio deepfake detectors lacking a [CLS] token. They compare GATR against Grad-CAM and SHAP methods using quantitative faithfulness metrics and a novel partial spoof test to evaluate the importance of different temporal regions in audio.

Datasets

ASVspoof 2019 (ASV19), In-The-Wild (ITW), PartialSpoof

Model(s)

Wav2Vec2-AASIST

Author countries

Switzerland, USA

← Previous