MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Existing benchmarks struggle to effectively evaluate the continual exploration capabilities of multimodal large language models (MLLMs) in dynamic, open-world environments. To address this gap, this work proposes MineExplorer, a benchmark built upon Minecraft that constructs atomic tasks devoid of reliance on domain-specific knowledge and composes them into implicit multi-hop exploration challenges. Leveraging a multi-agent collaborative synthesis pipeline, the framework jointly generates task graphs, sandboxed scenarios, and rule-driven milestone evaluators. Integration with the ReAct reasoning framework and human validation significantly enhances evaluation reliability and generalizability. Experiments reveal that while state-of-the-art MLLMs perform well on single-hop tasks, their performance markedly degrades in multi-hop exploration requiring long-horizon coordination and implicit prerequisite reasoning. Moreover, neither model scale nor chain-of-thought strategies consistently improve exploration efficacy.

📝 Abstract

Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluating open-world exploration capabilities of MLLM agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft-specific knowledge to better reflect general open-world reasoning. Then we organize the benchmark around a ReAct-style capability formulation and compose atomic tasks into implicit multi-hop tasks. To further construct reliable instances, MineExplorer uses a multi-agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule-based milestone evaluators. Human evaluation shows that the multi-agent synthesis workflow produces significantly more reliable instances than a single-agent baseline. Experiments with advanced MLLM agents show that open-world exploration remains challenging, as strong models can handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at https://github.com/Jometeorie/MineExplorer.

Problem

Research questions and friction points this paper is trying to address.

open-world exploration

multimodal large language models

embodied AI

Minecraft benchmark

task generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-world exploration

multimodal large language models

task synthesis