WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language model benchmarks suffer from insufficient visual diversity and reasoning complexity, limiting their ability to comprehensively evaluate models’ visual understanding in open-world scenarios. To address this gap, this work proposes WorldBench, the first benchmark that systematically integrates a large-scale, multi-domain visual concept taxonomy with human-crafted, high-difficulty questions to construct a highly diverse image set and challenging evaluation tasks. By leveraging taxonomy-guided image collection, multi-source data fusion, and a structured trial-and-error approach to question design—combined with both automatic and human evaluation—WorldBench substantially enhances the breadth and difficulty of multimodal assessment. Experiments demonstrate that WorldBench surpasses prior benchmarks in visual diversity, with the best-performing among 15 leading multimodal large language models achieving only 64.0% accuracy, revealing significant limitations in current models’ capacity for complex visual reasoning.

📝 Abstract

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.

Problem

Research questions and friction points this paper is trying to address.

multimodal reasoning

visual diversity

benchmark

Multimodal Large Language Models

open-ended visual inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

WorldBench

visual diversity

multimodal reasoning