Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

195K/year
🤖 AI Summary
This work addresses the absence of large-scale, cross-domain benchmarks capable of evaluating proactive, real-time multimodal procedural guidance when users deviate from standard workflows. To bridge this gap, the authors propose a decoupled planning–interaction unified framework that integrates visual and state awareness to automatically detect off-plan actions and inject recovery-oriented instructions. Key contributions include the release of the EgoProactive dataset and an enhanced Pro²Bench benchmark, the design of a decoupled architecture specifically tailored for procedural state tracking and recovery, and a post-training strategy that generalizes across model families. The approach substantially outperforms both leading open-source and closed-source models across six benchmarks, with particularly notable gains in deviation recovery performance under controlled plan-quality conditions.
📝 Abstract
We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf{(1)}~we release \textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \textbf{(2)}~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbf{Pro\textsuperscript{2}Bench} under a unified proactive-guidance schema; \textbf{(3)}~we propose a \textbf{decoupled planner--interaction architecture} specialized for procedural state, visual cues, and recovery injection; \textbf{(4)}~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.
Problem

Research questions and friction points this paper is trying to address.

proactive procedural assistance
Out-of-Plan recovery
multi-modal assistant
benchmark dataset
step deviation
Innovation

Methods, ideas, or system contributions that make the work stand out.

proactive procedural assistance
Out-of-Plan recovery
decoupled planner-interaction architecture
multimodal benchmark
post-training transfer