Character-Level Perturbations Disrupt LLM Watermarks

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work exposes the severe fragility of existing large language model (LLM) watermarking schemes under character-level perturbations—including typographical errors, character swaps/deletions, and homoglyph substitutions—which disrupt tokenization and thereby corrupt watermark signals across tokens, drastically degrading detection robustness. To exploit this vulnerability, the authors propose two novel black-box attacks operating under low query budgets: (1) a genetic-algorithm-guided watermark removal strategy, and (2) an adaptive composite character perturbation framework. Experiments demonstrate up to 92% watermark removal success across multiple state-of-the-art watermarking methods—substantially outperforming prior attacks—and show resilience against existing watermark defenses. The study reveals a critical design gap in current LLM watermarking: insufficient consideration of perturbations affecting underlying text representation and tokenization. It provides empirical evidence and concrete directions for developing tokenization-robust watermarking mechanisms.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) watermarking embeds detectable signals into generated text for copyright protection, misuse prevention, and content detection. While prior studies evaluate robustness using watermark removal attacks, these methods are often suboptimal, creating the misconception that effective removal requires large perturbations or powerful adversaries. To bridge the gap, we first formalize the system model for LLM watermark, and characterize two realistic threat models constrained on limited access to the watermark detector. We then analyze how different types of perturbation vary in their attack range, i.e., the number of tokens they can affect with a single edit. We observe that character-level perturbations (e.g., typos, swaps, deletions, homoglyphs) can influence multiple tokens simultaneously by disrupting the tokenization process. We demonstrate that character-level perturbations are significantly more effective for watermark removal under the most restrictive threat model. We further propose guided removal attacks based on the Genetic Algorithm (GA) that uses a reference detector for optimization. Under a practical threat model with limited black-box queries to the watermark detector, our method demonstrates strong removal performance. Experiments confirm the superiority of character-level perturbations and the effectiveness of the GA in removing watermarks under realistic constraints. Additionally, we argue there is an adversarial dilemma when considering potential defenses: any fixed defense can be bypassed by a suitable perturbation strategy. Motivated by this principle, we propose an adaptive compound character-level attack. Experimental results show that this approach can effectively defeat the defenses. Our findings highlight significant vulnerabilities in existing LLM watermark schemes and underline the urgency for the development of new robust mechanisms.

Problem

Research questions and friction points this paper is trying to address.

Character-level perturbations disrupt LLM watermark robustness

Analyzing attack range of perturbations on tokenization process

Proposing genetic algorithm for effective watermark removal

Innovation

Methods, ideas, or system contributions that make the work stand out.

Character-level perturbations disrupt tokenization process

Genetic Algorithm optimizes watermark removal attacks

Adaptive compound attack defeats fixed defenses

🔎 Similar Papers

Can Watermarked LLMs be Identified by Users via Crafted Prompts?