Multi-Granularity Adaptive Time-Frequency Attention Framework for Audio Deepfake Detection under Real-World Communication Degradations

View on arXiv ← Back to list

Authors: Haohan Shi, Xiyu Shi, Safak Dogan, Tianjin Huang, Yunxiao Zhang

Published: 2025-08-02 19:11:41+00:00

AI Summary

This paper introduces a novel multi-granularity adaptive time-frequency attention framework for robust audio deepfake detection under real-world communication degradations. The framework uses a multi-scale attention mechanism to capture both global and local features, and an adaptive fusion mechanism to dynamically adjust attention based on degradation characteristics, improving detection accuracy in noisy conditions.

Abstract

The rise of highly convincing synthetic speech poses a growing threat to audio communications. Although existing Audio Deepfake Detection (ADD) methods have demonstrated good performance under clean conditions, their effectiveness drops significantly under degradations such as packet losses and speech codec compression in real-world communication environments. In this work, we propose the first unified framework for robust ADD under such degradations, which is designed to effectively accommodate multiple types of Time-Frequency (TF) representations. The core of our framework is a novel Multi-Granularity Adaptive Attention (MGAA) architecture, which employs a set of customizable multi-scale attention heads to capture both global and local receptive fields across varying TF granularities. A novel adaptive fusion mechanism subsequently adjusts and fuses these attention branches based on the saliency of TF regions, allowing the model to dynamically reallocate its focus according to the characteristics of the degradation. This enables the effective localization and amplification of subtle forgery traces. Extensive experiments demonstrate that the proposed framework consistently outperforms state-of-the-art baselines across various real-world communication degradation scenarios, including six speech codecs and five levels of packet losses. In addition, comparative analysis reveals that the MGAA-enhanced features significantly improve separability between real and fake audio classes and sharpen decision boundaries. These results highlight the robustness and practical deployment potential of our framework in real-world communication environments.

Key findings

The proposed framework consistently outperforms state-of-the-art baselines across various real-world communication degradation scenarios. The MGAA enhanced features significantly improve separability between real and fake audio classes. The framework also demonstrates high efficiency and practicality for real-world deployment.

Approach

The authors propose a unified framework that incorporates a Multi-Granularity Adaptive Attention (MGAA) architecture. MGAA uses global and local attention heads to capture features at different scales, and an adaptive fusion mechanism dynamically weights these attention branches based on input degradation.

Datasets

Fake-or-Real (FoR), Wavefake, LJSpeech, MLAAD-EN, M-AILABS, and ASVspoof 2021 Logical Access (ASVLA) datasets were used to create a training dataset (Dcom) with simulated real-world communication degradations (six speech codecs and five packet loss rates). The ADD-C test dataset was used for evaluation.

Model(s)

A convolutional neural network (CNN) based framework with a Multi-Granularity Adaptive Time-Frequency Attention (MGAA) module. The MGAA module consists of Global Time-Frequency Attention (GTFA), Local Time-Frequency Attention (LTFA), and an Adaptive Fusion Module (AFM).

Author countries

United Kingdom, United Kingdom, United Kingdom, United Kingdom, United Kingdom

← Previous