Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Current vision-language models struggle to simultaneously achieve embodied reasoning, temporal consistency, and context-aware planning in video understanding. To address this, this work introduces pause-and-think-T, a reasoning-centric training dataset, and pause-and-think-B, a corresponding evaluation benchmark, which jointly encourage models to pause observation, perform structured reasoning grounded in visual evidence, and generate concise, executable action plans. With targeted reasoning supervision, a compact 4B-parameter model achieves substantial gains in generalization and practicality, attaining 58.0% accuracy on the proposed benchmark—despite having only 1/59 the parameters of Qwen3-VL-235B and matching the performance of GPT-5.2—while significantly outperforming GPT-4o on out-of-distribution tasks such as EgoThink and TempCompass.

📝 Abstract

Recent Vision-Language Models (VLMs) struggle with grounded reasoning, temporal consistency, and context aware planning in videos. We introduce pause-and-think-T, a reasoning-centric training dataset that encourages models to pause, reason over visual evidence, and produce concise, actionable responses. The dataset promotes structured reasoning prior to answer generation, guiding models toward human-like, scene-grounded assistance. We fine-tune a compact 4B-parameter model and evaluate it on our pause-and-think-B benchmark targeting contextual understanding and goal planning tasks. The model achieves 58.0% accuracy at 59x fewer parameters than Qwen3-VL-235B (58.9%), matching GPT-5.2 on scene understanding and surpassing GPT-4o. Beyond our benchmark, it also shows strong out-of-distribution performance on EgoThink and TempCompass, with substantial gains in affordance, assistance, attribution recognition, situated reasoning, and temporal order, without benchmark-specific training. Our results indicate that targeted reasoning supervision enables compact models to deliver actionable, visually grounded guidance while generalizing beyond training data, without requiring large-scale model expansion.

Problem

Research questions and friction points this paper is trying to address.

grounded reasoning

temporal consistency

context-aware planning

video understanding

assistive action suggestion

Innovation

Methods, ideas, or system contributions that make the work stand out.

grounded reasoning

structured reasoning

video-language models