Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling

View on arXiv ← Back to list

Authors: Xuanjun Chen, Shih-Peng Cheng, Jiawei Du, Lin Zhang, Xiaoxiao Miao, Chung-Che Wang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

Published: 2025-08-04 02:41:09+00:00

AI Summary

The paper introduces HBMNet, a hierarchical boundary modeling network for audio-visual deepfake localization. HBMNet improves localization by integrating audio-visual features, multi-scale temporal cues, and bidirectional boundary-content relationships, outperforming existing methods.

Abstract

Audio-visual temporal deepfake localization under the content-driven partial manipulation remains a highly challenging task. In this scenario, the deepfake regions are usually only spanning a few frames, with the majority of the rest remaining identical to the original. To tackle this, we propose a Hierarchical Boundary Modeling Network (HBMNet), which includes three modules: an Audio-Visual Feature Encoder that extracts discriminative frame-level representations, a Coarse Proposal Generator that predicts candidate boundary regions, and a Fine-grained Probabilities Generator that refines these proposals using bidirectional boundary-content probabilities. From the modality perspective, we enhance audio-visual learning through dedicated encoding and fusion, reinforced by frame-level supervision to boost discriminability. From the temporal perspective, HBMNet integrates multi-scale cues and bidirectional boundary-content relationships. Experiments show that encoding and fusion primarily improve precision, while frame-level supervision boosts recall. Each module (audio-visual fusion, temporal scales, bi-directionality) contributes complementary benefits, collectively enhancing localization performance. HBMNet outperforms BA-TFD and UMMAFormer and shows improved potential scalability with more training data.

Key findings

HBMNet significantly outperforms BA-TFD and UMMAFormer on the AV-Deepfake-1M dataset in terms of both precision and recall. The improvements are attributed to effective audio-visual fusion, multi-scale boundary modeling, and bidirectional processing. Performance scales well with increased training data.

Approach

HBMNet uses an Audio-Visual Feature Encoder to extract frame-level representations, a Coarse Proposal Generator to predict boundary regions, and a Fine-grained Probabilities Generator to refine these proposals using bidirectional probabilities. It leverages multi-scale cues and frame-level supervision.

Datasets

AV-Deepfake-1M (a subset of 8000 videos for training, 1000 for validation, and 2000 for testing)

Model(s)

Hierarchical Boundary Modeling Network (HBMNet), which consists of an Audio-Visual Feature Encoder (AVFE), a Coarse Proposal Generator (CPG), and a Fine-grained Probabilities Generator (FPG). AVFE uses SENet for audio and 3D CNN, ResNet-18, and TCN for video. CPG uses a Boundary-Matching layer. FPG uses a Nested U-Net.

Author countries

Taiwan, USA

← Previous