CrafText Benchmark: Advancing Instruction Following in Complex Multimodal Open-Ended World

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing instruction-following benchmarks predominantly operate in static environments and rely on simplistic commands, failing to assess agent capabilities in dynamic, multimodal open-world settings. To address this gap, we introduce CrafText—the first benchmark explicitly designed for such realistic scenarios—comprising four task categories (localization, conditional reasoning, construction, and goal achievement), 3,924 complex instructions, and 3,423 distinct vocabulary items. Methodologically, CrafText innovates with dynamically evolving task configurations and a rigorous instruction generalization evaluation protocol, departing from static paradigms. It integrates a multimodal interactive environment, explicit task-structure modeling, and adaptive evaluation metrics to jointly quantify language understanding and adaptive decision-making. As a standardized, reproducible evaluation framework, CrafText significantly enhances the fidelity and discriminative power of instruction-following assessment under real-world complexity.

Technology Category

Application Category

📝 Abstract

Following instructions in real-world conditions requires the ability to adapt to the world's volatility and entanglement: the environment is dynamic and unpredictable, instructions can be linguistically complex with diverse vocabulary, and the number of possible goals an agent may encounter is vast. Despite extensive research in this area, most studies are conducted in static environments with simple instructions and a limited vocabulary, making it difficult to assess agent performance in more diverse and challenging settings. To address this gap, we introduce CrafText, a benchmark for evaluating instruction following in a multimodal environment with diverse instructions and dynamic interactions. CrafText includes 3,924 instructions with 3,423 unique words, covering Localization, Conditional, Building, and Achievement tasks. Additionally, we propose an evaluation protocol that measures an agent's ability to generalize to novel instruction formulations and dynamically evolving task configurations, providing a rigorous test of both linguistic understanding and adaptive decision-making.

Problem

Research questions and friction points this paper is trying to address.

Assessing agent performance in dynamic, unpredictable environments

Evaluating instruction following with diverse, complex linguistic inputs

Measuring generalization to novel instructions and evolving tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CrafText benchmark for diverse instruction evaluation

Includes 3,924 instructions with 3,423 unique words

Proposes protocol for generalization and dynamic adaptation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Authors to Follow