Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This study addresses a critical vulnerability in current large language model (LLM) watermarking schemes: in multi-model deployment scenarios, independently applied watermark perturbations interfere with one another, severely undermining detection reliability and source attribution. The work is the first to expose this fundamental weakness and introduces WASH, a novel method that linearly ensembles output distributions from multiple models within a unified framework featuring vocabulary alignment and statistical mixture modeling. This approach effectively neutralizes heterogeneous watermark perturbations. Remarkably, integrating outputs from just three models reduces watermark detection z-scores from 5–300 to below 2, suppresses true positive rates to under 50%, enhances text quality by 27.5%, and accelerates generation speed by sixfold.

📝 Abstract

Watermarking embeds statistical signatures in AI-generated text for detection and attribution. We reveal a fundamental vulnerability: when users access multiple models (today's reality), watermarks trivially fail. Watermarks perturb output distributions away from the original, and in competitive markets, these perturbations are typically independent across providers. We theoretically prove that averaging output probability distributions recovers the unwatermarked distribution with up to a second-order error term. Empirically, simply averaging 3-5 models cancels out these perturbations. We introduce WASH (Watermark Attenuation via Statistical Hybridisation), which solves practical challenges in ensemble generation: vocabulary misalignment and tokenisation differences across heterogeneous models. Experiments across six watermarking schemes and three LLMs show that averaging across 3 models suppresses detection z-scores from 5-300 to below 2 (below the detection threshold of 4) and reduces TPR at 5% FPR to below 50%, while improving quality by 27.5% and running 6 times faster than the best baseline on the long sequence generation. Our results suggest that robust AI-text detection via watermarking requires either accepting this fundamental vulnerability or unprecedented coordination among model providers.

Problem

Research questions and friction points this paper is trying to address.

watermarking

large language models

distributional perturbations

ensemble methods

AI-generated text detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

watermarking

linear ensembles

distributional perturbations