mAVE: A Watermark for Joint Audio-Visual Generation Models

Authors: Luyang Si, Leyi Pan, Lijie Wen

Published: 2026-03-07 07:59:31+00:00

AI Summary

This paper introduces mAVE (Manifold Audio-Visual Entanglement), a novel watermarking framework designed for joint audio-visual generation models to address the 'Binding Vulnerability' of existing methods. mAVE cryptographically binds audio and video latents at initialization, creating a 'Legitimate Entanglement Manifold' to protect vendor copyright and ensure content provenance. It demonstrates performance-losslessness and provides an exponential security bound against Swap Attacks, achieving over 99% binding integrity.

Abstract

As Joint Audio-Visual Generation Models see widespread commercial deployment, embedding watermarks has become essential for protecting vendor copyright and ensuring content provenance. However, existing techniques suffer from an architectural mismatch by treating modalities as decoupled entities, exposing a critical Binding Vulnerability. Adversaries exploit this via Swap Attacks by replacing authentic audio with malicious deepfakes while retaining the watermarked video. Because current detectors rely on independent verification ($Video_{wm}\\vee Audio_{wm}$), they incorrectly authenticate the manipulated content, falsely attributing harmful media to the original vendor and severely damaging their reputation. To address this, we propose mAVE (Manifold Audio-Visual Entanglement), the first watermarking framework natively designed for joint architectures. mAVE cryptographically binds audio and video latents at initialization without fine-tuning, defining a Legitimate Entanglement Manifold via Inverse Transform Sampling. Experiments on state-of-the-art models (LTX-2, MOVA) demonstrate that mAVE guarantees performance-losslessness and provides an exponential security bound against Swap Attacks. Achieving near-perfect binding integrity ($>99\\%$), mAVE offers a robust cryptographic defense for vendor copyright.


Key findings
mAVE achieves 99.9% accuracy in defending against Swap Attacks, significantly outperforming unimodal watermarking baselines (e.g., 50% for weak baseline and 86.2% for strong baseline with SyncNet). The framework guarantees performance-losslessness, maintaining generation quality statistically indistinguishable from unwatermarked content. It also demonstrates strong robustness against various video and audio attacks while achieving near-perfect binding integrity (>99%).
Approach
mAVE intervenes at the initialization stage of joint audio-visual generation models, cryptographically binding initial audio and video noise latents. It constructs a 'Legitimate Entanglement Manifold' via Inverse Transform Sampling, functionally linking audio noise to a cryptographic hash of video noise. This process ensures both modalities originate from the same session, preventing swap attacks without requiring model fine-tuning.
Datasets
LTX-2 (generated content for evaluation), MOVA-720p (generated content for evaluation), VBench (text prompts for generation), Stable Diffusion (for I2AV input images).
Model(s)
LTX-2, MOVA-720p (joint audio-visual generation models on which mAVE operates and evaluates detection), VideoShield (baseline video watermarking), AudioSeal (baseline audio watermarking), WavMark (baseline audio watermarking), Timbre (baseline audio watermarking).
Author countries
China