GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis

Authors: Weizhi Liu, Yue Li, Dongdong Lin, Hui Tian, Haizhou Li

Published: 2024-07-15 06:57:19+00:00

AI Summary

This paper introduces GROOT, a generative robust audio watermarking method that embeds watermarks directly into audio during synthesis using diffusion models. This proactive approach surpasses existing state-of-the-art methods in robustness against various attacks, maintaining high watermark extraction accuracy.

Abstract

Amid the burgeoning development of generative models like diffusion models, the task of differentiating synthesized audio from its natural counterpart grows more daunting. Deepfake detection offers a viable solution to combat this challenge. Yet, this defensive measure unintentionally fuels the continued refinement of generative models. Watermarking emerges as a proactive and sustainable tactic, preemptively regulating the creation and dissemination of synthesized content. Thus, this paper, as a pioneer, proposes the generative robust audio watermarking method (Groot), presenting a paradigm for proactively supervising the synthesized audio and its source diffusion models. In this paradigm, the processes of watermark generation and audio synthesis occur simultaneously, facilitated by parameter-fixed diffusion models equipped with a dedicated encoder. The watermark embedded within the audio can subsequently be retrieved by a lightweight decoder. The experimental results highlight Groot's outstanding performance, particularly in terms of robustness, surpassing that of the leading state-of-the-art methods. Beyond its impressive resilience against individual post-processing attacks, Groot exhibits exceptional robustness when facing compound attacks, maintaining an average watermark extraction accuracy of around 95%.


Key findings
GROOT demonstrates superior robustness against individual and compound audio attacks, maintaining an average watermark extraction accuracy of around 95%. The method achieves high fidelity and capacity, even at 5000 bps, with minimal impact on audio quality. MGCNNs in the decoder significantly improve watermark extraction accuracy compared to traditional CNNs.
Approach
GROOT simultaneously generates watermarked audio and embeds a watermark using a parameter-fixed diffusion model with a dedicated encoder. A lightweight decoder then extracts the watermark. The model is trained using a joint optimization strategy balancing audio quality and watermark extraction accuracy.
Datasets
LJSpeech, LibriTTS, LibriSpeech
Model(s)
Diffusion models (specifically DiffWave, WaveGrad, and PriorGrad are evaluated), UNet-like network for denoising, modified gated convolutional neural networks (MGCNN) in the decoder.
Author countries
China, China, China, China, China