Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Vision-language agents are constrained by reliance on human-annotated supervision, while text-based self-evaluation suffers from hallucination and fails to reliably verify multi-step visual reasoning. Method: This paper proposes a tool-augmented self-evolving reasoning framework that establishes a closed-loop of “reasoning–self-evaluation–self-repair,” enabling continuous autonomous optimization without external rewards or human annotations. Crucially, tool invocation is deeply integrated into the reasoning chain to generate structured self-feedback, and fine-grained self-reward signals are designed to drive reinforcement learning. Contribution/Results: Evaluated on geometric problem solving and visual scientific analysis, the model achieves a 12.5% improvement over strong baselines. This work provides the first empirical validation of stable, self-sustained evolution in vision-language agents under zero external supervision.

Technology Category

Application Category

📝 Abstract

Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at href{https://github.com/aiming-lab/Agent0/Agent0-VL}{this https URL}.

Problem

Research questions and friction points this paper is trying to address.

Overcoming limitations of human-annotated supervision in vision-language agents

Addressing evaluation hallucinations in purely text-based self-assessment methods

Enabling continuous self-improvement through tool-integrated reasoning and verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-evolving agent with tool-integrated reasoning

Unified solver and verifier roles in single model

Tool-grounded self-evaluation enabling autonomous improvement

🔎 Similar Papers

No similar papers found.