Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the limitations of vision-language models (VLMs) in event-level visual reasoning—specifically their capacity for temporal ordering, causal inference, spatial reasoning, contextual understanding, and commonsense reasoning. To address the lack of dedicated benchmarks, we introduce SPLICE, the first multidimensional evaluation benchmark for event reasoning: built upon the COIN dataset, it comprises 3,381 manually curated videos and features an event-segment reordering task augmented with textual descriptions to systematically assess leading VLMs. Experimental results reveal that current VLMs heavily rely on linguistic priors while exhibiting weak visual grounding; although they achieve moderate performance on everyday scenarios and temporal–causal reasoning, their overall event-level spatiotemporal modeling falls far short of human capability—exposing a fundamental bottleneck. This work pioneers and empirically validates a fine-grained, multidimensional evaluation paradigm explicitly designed for event reasoning.

Technology Category

Application Category

📝 Abstract
In this work, we introduce SPLICE, a human-curated benchmark derived from the COIN instructional video dataset, designed to probe event-based reasoning across multiple dimensions: temporal, causal, spatial, contextual, and general knowledge. SPLICE includes 3,381 human-filtered videos spanning 12 categories and 180 sub-categories, such as sports, engineering, and housework. These videos are segmented into a total of 11,423 event clips. We evaluate both human participants and state-of-the-art vision-language models (VLMs) on the task of rearranging these clips into coherent event sequences to assess visual reasoning capabilities. Results reveal a significant gap: VLMs struggle to match human performance. While human-annotated textual descriptions improve model accuracy, they do not affect human performance, suggesting that models rely more on language priors than on visual understanding. Even with annotations, VLMs fall short of human-level reasoning, underscoring persistent challenges in visual reasoning. A deeper analysis across sub-categories shows that VLMs perform relatively better on videos where temporal and causal reasoning are dominant, compared to those where contextual and spatial reasoning are dominant. They also perform better on everyday tasks than on specialized ones.
Problem

Research questions and friction points this paper is trying to address.

Evaluating visual reasoning in VLMs using event sequence rearrangement tasks
Assessing VLMs' performance across temporal, causal, spatial, and contextual reasoning
Identifying gaps between human and model capabilities in visual understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-curated benchmark for visual reasoning evaluation
Segmented video clips for event sequence rearrangement
Assessed VLMs across multiple reasoning dimensions
🔎 Similar Papers
No similar papers found.
Mohamad Ballout
Mohamad Ballout
PhD in Cognitive Science, University of Osnabrück
Computer VisionDeep LearningCognitive Science
O
Okajevo Wilfred
Institute of Cognitive Science, Osnabrück University, Osnabrück, Germany
S
Seyedalireza Yaghoubi
Institute of Cognitive Science, Osnabrück University, Osnabrück, Germany
N
Nohayr Muhammad Abdelmoneim
Institute of Cognitive Science, Osnabrück University, Osnabrück, Germany
Julius Mayer
Julius Mayer
Natural Language Processing, Osnabrück University
Reinforcement LearningWorld ModelsLanguage Emergence
Elia Bruni
Elia Bruni
University of Osnabrück
Natural Language ProcessingComputational Dialogue ModellingComputer VisionMachine Learning