VLA-Mark: A cross modal watermark for large vision-language alignment model

📅 2025-07-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address intellectual property protection for vision-language alignment (VLA) models, this paper proposes VLA-Mark—the first cross-modal watermarking framework tailored for large-scale VLA models. Methodologically, it embeds detectable watermarks via a multi-scale vision–text alignment mechanism without compromising semantic consistency between images and text. It further introduces an entropy-sensitive dynamic modulation strategy that adaptively allocates watermark strength based on generation uncertainty, prioritizing visual grounding fidelity during low-entropy phases. Crucially, VLA-Mark operates entirely post-training, requiring no model retraining. Experiments demonstrate that VLA-Mark preserves generation quality while substantially enhancing watermark robustness: perplexity decreases by 7.4%, BLEU score improves by 26.6%, watermark detection AUC reaches 98.8%, and robustness against paraphrasing and synonym substitution attacks attains 96.1%.

Technology Category

Application Category

📝 Abstract
Vision-language models demand watermarking solutions that protect intellectual property without compromising multimodal coherence. Existing text watermarking methods disrupt visual-textual alignment through biased token selection and static strategies, leaving semantic-critical concepts vulnerable. We propose VLA-Mark, a vision-aligned framework that embeds detectable watermarks while preserving semantic fidelity through cross-modal coordination. Our approach integrates multiscale visual-textual alignment metrics, combining localized patch affinity, global semantic coherence, and contextual attention patterns, to guide watermark injection without model retraining. An entropy-sensitive mechanism dynamically balances watermark strength and semantic preservation, prioritizing visual grounding during low-uncertainty generation phases. Experiments show 7.4% lower PPL and 26.6% higher BLEU than conventional methods, with near-perfect detection (98.8% AUC). The framework demonstrates 96.1% attack resilience against attacks such as paraphrasing and synonym substitution, while maintaining text-visual consistency, establishing new standards for quality-preserving multimodal watermarking
Problem

Research questions and friction points this paper is trying to address.

Protecting intellectual property in vision-language models without disrupting multimodal coherence
Overcoming biased token selection and static strategies in existing text watermarking methods
Preserving semantic fidelity and visual-textual alignment during watermark embedding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal coordination preserves semantic fidelity
Multiscale alignment metrics guide watermark injection
Entropy-sensitive mechanism balances watermark strength
🔎 Similar Papers
No similar papers found.