Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

πŸ“… 2025-09-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the bottleneck of vision-language models (VLMs) relying on costly human annotations for reasoning capability enhancement, this paper proposes Vision-Zeroβ€”the first annotation-free self-improvement framework for VLMs. Methodologically, it introduces a strategic self-play mechanism inspired by the β€œWho’s the Spy?” paradigm, constructing competitive visual games from arbitrary image pairs. It integrates reinforcement learning with verifiable rewards (RLVR) and iterative self-play optimization (Iterative-SPO) to enable continuous cross-domain (synthetic β†’ real) performance gains and strong generalization. Empirically, Vision-Zero surpasses supervised baselines on reasoning, chart question answering, and visual understanding tasks, achieving state-of-the-art (SOTA) results. The framework demonstrates robust zero-shot transfer and scalability without human supervision. All models and code are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.
Problem

Research questions and friction points this paper is trying to address.

Eliminates dependency on manual datasets for VLM training
Enables self-improvement through competitive visual games
Achieves sustainable performance gains without human annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-improving VLMs via competitive visual games
Generating games from arbitrary image pairs
Iterative self-play policy optimization for sustained gains
πŸ”Ž Similar Papers
Qinsi Wang
Qinsi Wang
Duke University
Efficiency LLMModel Accelerate
B
Bo Liu
National University of Singapore
T
Tianyi Zhou
University of Maryland
J
Jing Shi
Adobe Inc.
Yueqian Lin
Yueqian Lin
PhD Student, Duke University
Y
Yiran Chen
Duke University
H
Hai Helen Li
Duke University
Kun Wan
Kun Wan
Applied Scientist, Adobe Inc.
W
Wentian Zhao
Adobe Inc.