LAVA: Layered Audio-Visual Anti-tampering Watermarking for Robust Deepfake Detection and Localization

Authors: Bokang Zeng, Zheng Gao, Xiaoyu Li, Xiaoyan Feng, Jiaojiao Jiang

Published: 2026-04-27 02:02:59+00:00

Comment: 10 pages, submitted to ACMMM 2026

AI Summary

This paper introduces LAVA (Layered Audio-Visual Anti-tampering Watermarking), a framework for robust deepfake tamper detection and localization in short-form videos. It addresses limitations of existing watermarking methods, which often decouple audio and visual evidence and struggle under multimodal misalignment and compression. LAVA achieves this by leveraging cross-modal watermark fusion and calibration-aware alignment to maintain reliable tamper evidence across various real-world degradations.

Abstract

Proactive watermarking offers a promising approach for deepfake tamper detection and localization in short-form videos. However, existing methods often decouple audio and visual evidence and assume that watermark signals remain reliable under real-world degradations, making tamper localization vulnerable to multimodal misalignment and compression distortions. Moreover, existing semi-fragile visual watermarking methods often degrade significantly under codec compression because their embedding bands overlap with compression-sensitive frequency regions. To address these limitations, we propose Layered Audio-Visual Anti-tampering Watermarking (LAVA), a calibration-aware audio-visual watermark fusion framework for deepfake tamper detection and localization. LAVA leverages cross-modal watermark fusion and calibration-aware alignment to preserve consistent and reliable tamper evidence under compression and audio-visual asynchrony, enabling robust tamper localization. Extensive experiments demonstrate that LAVA achieves near-perfect detection performance (AP = 0.999), remains robust to compression and multimodal misalignment, and significantly improves tamper localization reliability over existing audio-visual fusion baselines.


Key findings
LAVA achieves near-perfect detection performance (AP = 0.999) and robust localization (IoU = 0.955) under various distortions including compression and multimodal misalignment. Its layered design, particularly the reliability gate, significantly improves localization reliability and calibration quality (ECE ≤ 0.003) over unimodal baselines and naive fusion. The approach demonstrates high attribution accuracy (99.5–100%) and introduces imperceptible watermarks with zero false positives.
Approach
LAVA embeds independent semi-fragile watermarks into both visual and audio modalities. At inference, it employs a four-layer hierarchical process: temporal stretch correction, a reliability gate to manage global channel failures, confidence-weighted fusion of per-frame scores, and temperature scaling for probability calibration. This layered design ensures robust detection and localization by systematically addressing modality misalignment, channel unreliability, and score miscalibration.
Datasets
LAV-DF, FakeAVCeleb, VoxCeleb2
Model(s)
Semi-fragile visual and audio watermarking detectors (LAVA itself is a fusion framework and does not specify internal architectures for the detectors but operates on their outputs). Baselines include WAM (Visual-only) and AudioSeal (Audio-only) for watermark-based detection.
Author countries
Australia