Beyond Identity: A Generalizable Approach for Deepfake Audio Detection

View on arXiv ← Back to list

Authors: Yasaman Ahmadiadli, Xiao-Ping Zhang, Naimul Khan

Published: 2025-05-10 22:03:07+00:00

AI Summary

This research introduces an identity-independent audio deepfake detection framework that mitigates identity leakage by focusing on forgery-specific artifacts. The approach uses Artifact Detection Modules (ADMs) and novel dynamic artifact generation techniques to improve cross-dataset generalization.

Abstract

Deepfake audio presents a growing threat to digital security, due to its potential for social engineering, fraud, and identity misuse. However, existing detection models suffer from poor generalization across datasets, due to implicit identity leakage, where models inadvertently learn speaker-specific features instead of manipulation artifacts. To the best of our knowledge, this is the first study to explicitly analyze and address identity leakage in the audio deepfake detection domain. This work proposes an identity-independent audio deepfake detection framework that mitigates identity leakage by encouraging the model to focus on forgery-specific artifacts instead of overfitting to speaker traits. Our approach leverages Artifact Detection Modules (ADMs) to isolate synthetic artifacts in both time and frequency domains, enhancing cross-dataset generalization. We introduce novel dynamic artifact generation techniques, including frequency domain swaps, time domain manipulations, and background noise augmentation, to enforce learning of dataset-invariant features. Extensive experiments conducted on ASVspoof2019, ADD 2022, FoR, and In-The-Wild datasets demonstrate that the proposed ADM-enhanced models achieve F1 scores of 0.230 (ADD 2022), 0.604 (FoR), and 0.813 (In-The-Wild), consistently outperforming the baseline. Dynamic Frequency Swap proves to be the most effective strategy across diverse conditions. These findings emphasize the value of artifact-based learning in mitigating implicit identity leakage for more generalizable audio deepfake detection.

Key findings

The ADM-enhanced models significantly outperformed the baseline, achieving improved F1 scores across datasets (e.g., 0.813 on In-The-Wild). Dynamic Frequency Swap proved the most effective artifact generation strategy, demonstrating the value of artifact-based learning for generalizable deepfake detection.

Approach

The authors propose an Artifact Detection Module (ADM) trained to identify synthetic artifacts in audio deepfakes, independent of speaker identity. They introduce novel dynamic artifact generation techniques (frequency and time domain manipulations, background noise) to enhance the model's ability to generalize across datasets.

Datasets

ASVspoof2019 (LA), ADD 2022, FoR, In-The-Wild

Model(s)

Xception (primarily), with comparisons to EfficientNet-B3, ResNet50, and VGG16.

Author countries

Canada

← Previous