VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems

📅 2025-06-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the limited spatial environment understanding and poor task generalization in open-ended learning systems. We propose a multimodal enhancement framework that, for the first time, integrates visual feedback into the Voyager agent to enable prior-knowledge-free autonomous structure construction in Minecraft. Methodologically, we employ a GPT-4o-level multimodal large language model, jointly processing screen screenshots and pixel-level point-of-view (POV) observations to support a vision-grounded perception–planning–execution loop. Our core contribution lies in breaking the pure text-driven paradigm, thereby significantly extending open-ended learning systems’ capabilities in spatial modeling and embodied task generation. Experimental results demonstrate an average of 2.75 novel structures successfully built per 50 iterations, with a 50% success rate on planar-world unit tests—empirically validating the efficacy of vision-guided learning for open-ended task scalability.

Technology Category

Application Category

📝 Abstract

Open-endedness is an active field of research in the pursuit of capable Artificial General Intelligence (AGI), allowing models to pursue tasks of their own choosing. Simultaneously, recent advancements in Large Language Models (LLMs) such as GPT-4o [9] have allowed such models to be capable of interpreting image inputs. Implementations such as OMNI-EPIC [4] have made use of such features, providing an LLM with pixel data of an agent's POV to parse the environment and allow it to solve tasks. This paper proposes that providing these visual inputs to a model gives it greater ability to interpret spatial environments, and as such, can increase the number of tasks it can successfully perform, extending its open-ended potential. To this aim, this paper proposes VoyagerVision -- a multi-modal model capable of creating structures within Minecraft using screenshots as a form of visual feedback, building on the foundation of Voyager. VoyagerVision was capable of creating an average of 2.75 unique structures within fifty iterations of the system, as Voyager was incapable of this, it is an extension in an entirely new direction. Additionally, in a set of building unit tests VoyagerVision was successful in half of all attempts in flat worlds, with most failures arising in more complex structures. Project website is available at https://esmyth-dev.github.io/VoyagerVision.github.io/

Problem

Research questions and friction points this paper is trying to address.

Investigates multi-modal information for open-ended learning systems

Enhances spatial interpretation in AI using visual inputs

Extends open-ended potential in Minecraft structure creation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal model with visual feedback

Uses screenshots for spatial interpretation

Extends Voyager with Minecraft structure creation

🔎 Similar Papers

No similar papers found.

Authors to Follow