GuessBench: Sensemaking Multimodal Creativity in the Wild

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) inadequately capture human creativity in realistic, noisy, and multicultural contexts. Method: We introduce GuessBench—the first benchmark for creative understanding—built from 1,500 gameplay images and 2,000 dynamic/static reasoning questions derived from Minecraft’s “Guess the Build” game. We formalize open-world creative collaboration as a VLM evaluation task and propose a cross-modal fine-grained assessment protocol grounded in human-annotated creative reasoning trajectories. Contribution/Results: Experiments reveal that model performance is strongly influenced by concept frequency and cultural representativeness in training data; GPT-4o achieves only 66% accuracy (34% error rate), while API-based and open-source VLMs differ by 39.06% in average accuracy. Fine-tuning on reasoning trajectories improves visual perception accuracy by 15.36%. We systematically identify cultural and linguistic representation biases as key drivers of performance degradation. Code and data are publicly released.

Technology Category

Application Category

📝 Abstract
We propose GuessBench, a novel benchmark that evaluates Vision Language Models (VLMs) on modeling the pervasive, noisy, and pluralistic human creativity. GuessBench sources data from"Guess the Build", an online multiplayer Minecraft minigame where one player constructs a Minecraft build given a concept (e.g. caterpillar) and others try to guess it with natural language hints, presenting a pristine testbed for sensemaking creativity in the wild with VLMs acting as guessers. We curate 1500 images from the actual gameplay and design 2000 problems spanning static and dynamic image settings, natural language hints of varying completeness, and more. Extensive experiments with six open/API VLMs and five reasoning enhancement approaches demonstrate that GuessBench presents a uniquely challenging task in creativity modeling: even the start-of-the-art GPT-4o is incorrect on 34% of instances, while we observe a huge performance gap (13.87% vs. 53.93% on average) between open and API models. When used as a resource to improve VLMs, fine-tuning on the reasoning traces for GuessBench problems improves visual perception tasks by 15.36% on average. Further analysis reveals that VLM performance in creativity sensemaking correlates with the frequency of the concept in training data, while the accuracy drops sharply for concepts in underrepresented cultural contexts and low-resource languages.
Problem

Research questions and friction points this paper is trying to address.

Evaluates VLMs on modeling noisy human creativity
Tests VLMs as guessers in Minecraft gameplay scenarios
Reveals performance gaps in cultural and linguistic contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for Vision Language Models creativity
Uses Minecraft gameplay data for evaluation
Improves models via fine-tuning reasoning traces
🔎 Similar Papers
No similar papers found.