Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work exposes a systemic vulnerability in mainstream text watermarking schemes—namely, their reliance on high-entropy token embeddings. To exploit this flaw, we propose SIRA (Self-Information Rewriting Attack), the first generic, white-box-free, training-free, and lightweight-LLM-transferable attack framework. SIRA identifies patterned high-entropy tokens via self-information quantification, then applies zero-shot prompt-guided semantic-preserving rewriting coupled with dynamic resampling to achieve cross-model robust watermark removal. Evaluated on seven state-of-the-art watermarking methods, SIRA achieves near-perfect average attack success rates (~100%) at a cost of only $0.88 per million tokens, without requiring access to the watermark key or the target model. Crucially, this work is the first to identify, from an information-theoretic perspective, the inherent risk of the high-entropy embedding paradigm—providing foundational theoretical insights and empirical benchmarks for watermark robustness evaluation and defense design.

Technology Category

Application Category

📝 Abstract

Text watermarking aims to subtly embed statistical signals into text by controlling the Large Language Model (LLM)'s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100% attack success rates on seven recent watermarking methods with only 0.88 USD per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model, even mobile-level models. Our findings highlight the urgent need for more robust watermarking.

Problem

Research questions and friction points this paper is trying to address.

Exposes vulnerabilities in current text watermarking algorithms

Introduces Self-Information Rewrite Attack (SIRA) to exploit high-entropy token weaknesses

Demonstrates near 100% attack success with low-cost mobile-level models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Information Rewrite Attack (SIRA) exploits watermark vulnerabilities

Targets high-entropy tokens via self-information calculation

Achieves 100% success with low-cost mobile models

🔎 Similar Papers

Lost in Overlap: Exploring Logit-based Watermark Collision in LLMs