Frame-level Temporal Difference Learning for Partial Deepfake Speech Detection

View on arXiv ← Back to list

Authors: Menglu Li, Xiao-Ping Zhang, Lian Zhao

Published: 2025-07-20 19:46:23+00:00

AI Summary

This paper proposes a Temporal Difference Attention Module (TDAM) for partial deepfake speech detection that analyzes frame-level temporal differences without requiring frame-level annotations. TDAM identifies unnatural temporal variations in deepfake speech, achieving state-of-the-art performance on PartialSpoof and HAD datasets.

Abstract

Detecting partial deepfake speech is essential due to its potential for subtle misinformation. However, existing methods depend on costly frame-level annotations during training, limiting real-world scalability. Also, they focus on detecting transition artifacts between bonafide and deepfake segments. As deepfake generation techniques increasingly smooth these transitions, detection has become more challenging. To address this, our work introduces a new perspective by analyzing frame-level temporal differences and reveals that deepfake speech exhibits erratic directional changes and unnatural local transitions compared to bonafide speech. Based on this finding, we propose a Temporal Difference Attention Module (TDAM) that redefines partial deepfake detection as identifying unnatural temporal variations, without relying on explicit boundary annotations. A dual-level hierarchical difference representation captures temporal irregularities at both fine and coarse scales, while adaptive average pooling preserves essential patterns across variable-length inputs to minimize information loss. Our TDAM-AvgPool model achieves state-of-the-art performance, with an EER of 0.59% on the PartialSpoof dataset and 0.03% on the HAD dataset, which significantly outperforms the existing methods without requiring frame-level supervision.

Key findings

The TDAM-AvgPool model achieves state-of-the-art Equal Error Rates (EERs) of 0.59% on PartialSpoof and 0.03% on HAD. The model effectively detects deepfake artifacts without frame-level supervision and demonstrates strong generalization across different datasets and languages.

Approach

The approach leverages pre-trained wav2vec2-XLSR embeddings and analyzes frame-level temporal differences to detect unnatural variations in deepfake speech. A dual-level hierarchical difference representation captures temporal irregularities at fine and coarse scales, while adaptive average pooling handles variable-length inputs.

Datasets

PartialSpoof (PS) and Half-Truth (HAD) datasets

Model(s)

wav2vec2-XLSR, Temporal Difference Attention Module (TDAM) with adaptive average pooling (TDAM-AvgPool)

Author countries

Canada, China

← Previous