On Evaluating The Performance of Watermarked Machine-Generated Texts Under Adversarial Attacks

View on arXiv ← Back to list

Authors: Zesen Liu, Tianshuo Cong, Xinlei He, Qi Li

Published: 2024-07-05 18:09:06+00:00

AI Summary

This paper evaluates the robustness of eight text watermarking schemes for Large Language Models (LLMs) against twelve adversarial attacks. The study finds that current watermarking techniques are vulnerable to various attacks, particularly combined attacks, highlighting the need for more resilient solutions.

Abstract

Large Language Models (LLMs) excel in various applications, including text generation and complex tasks. However, the misuse of LLMs raises concerns about the authenticity and ethical implications of the content they produce, such as deepfake news, academic fraud, and copyright infringement. Watermarking techniques, which embed identifiable markers in machine-generated text, offer a promising solution to these issues by allowing for content verification and origin tracing. Unfortunately, the robustness of current LLM watermarking schemes under potential watermark removal attacks has not been comprehensively explored. In this paper, to fill this gap, we first systematically comb the mainstream watermarking schemes and removal attacks on machine-generated texts, and then we categorize them into pre-text (before text generation) and post-text (after text generation) classes so that we can conduct diversified analyses. In our experiments, we evaluate eight watermarks (five pre-text, three post-text) and twelve attacks (two pre-text, ten post-text) across 87 scenarios. Evaluation results indicate that (1) KGW and Exponential watermarks offer high text quality and watermark retention but remain vulnerable to most attacks; (2) Post-text attacks are found to be more efficient and practical than pre-text attacks; (3) Pre-text watermarks are generally more imperceptible, as they do not alter text fluency, unlike post-text watermarks; (4) Additionally, combined attack methods can significantly increase effectiveness, highlighting the need for more robust watermarking solutions. Our study underscores the vulnerabilities of current techniques and the necessity for developing more resilient schemes.

Key findings

KGW and Exponential watermarks showed high text quality and watermark retention but remained vulnerable to many attacks; Post-text attacks were generally more efficient than pre-text attacks; Combined attack methods significantly increased effectiveness, underscoring the need for improved watermarking robustness.

Approach

The researchers systematically categorized watermarking schemes and removal attacks into pre-text (before text generation) and post-text (after text generation) classes. They evaluated eight watermarking schemes and twelve attacks across 87 scenarios using metrics like quality score, watermark rate, and robustness score.

Datasets

Watermark generation dataset (DGen) containing 296 instructions for three long text generation tasks (book report, story generation, and fake news), and OpenWebText for the distillation attack.

Model(s)

Llama-2-7B-chat (target model), Llama-3-8B-instruct (for quality evaluation and imperceptibility testing), GPT-3.5 (mentioned in related work for synonym attacks), and various other models from the Llama family (for paraphrase attacks).

Author countries

China, Hong Kong

← Previous