Score the Steps, Not Just the Goal: VLM-Based Subgoal Evaluation for Robotic Manipulation

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current robot learning research predominantly relies on a single binary success rate (SR) metric, which fails to reveal fine-grained performance across intermediate subgoals (e.g., grasping, pouring) in multi-step manipulation tasks. To address this, we propose StepEval—the first automated, subgoal-level evaluation framework tailored for robotic manipulation. Built upon vision-language models (VLMs), StepEval accepts single- or multi-view images or videos, and features a lightweight, modular, model-agnostic architecture enabling plug-and-play deployment. It outputs a per-step subgoal success vector, explicitly capturing partial success. Its cost-aware design jointly optimizes evaluation accuracy and efficiency—requiring no new benchmarks or external APIs—and supports diagnostic analysis of latency and computational overhead when ground-truth annotations are available. StepEval advances reproducible, transparent, and fine-grained cross-laboratory evaluation standards for robot learning.

Technology Category

Application Category

📝 Abstract
Robot learning papers typically report a single binary success rate (SR), which obscures where a policy succeeds or fails along a multi-step manipulation task. We argue that subgoal-level reporting should become routine: for each trajectory, a vector of per-subgoal SRs that makes partial competence visible (e.g., grasp vs. pour). We propose a blueprint for StepEval, a cost-aware plug-in evaluation framework that utilizes vision-language models (VLMs) as automated judges of subgoal outcomes from recorded images or videos. Rather than proposing new benchmarks or APIs, our contribution is to outline design principles for a scalable, community-driven open-source project. In StepEval, the primary artifact for policy evaluation is the per-subgoal SR vector; however, other quantities (e.g., latency or cost estimates) are also considered for framework-optimization diagnostics to help the community tune evaluation efficiency and accuracy when ground-truth subgoal success labels are available. We discuss how such a framework can remain model-agnostic, support single- or multi-view inputs, and be lightweight enough to adopt across labs. The intended contribution is a shared direction: a minimal, extensible seed that invites open-source contributions, so that scoring the steps, not just the final goal, becomes a standard and reproducible practice.
Problem

Research questions and friction points this paper is trying to address.

Current robot learning evaluations obscure failure points in multi-step manipulation tasks
Lack of standardized subgoal-level reporting for partial competence visibility in robotics
Need for scalable automated evaluation framework using vision-language models as judges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision-language models as automated subgoal judges
Proposes cost-aware evaluation framework with per-subgoal metrics
Designs model-agnostic lightweight framework for multi-view inputs
🔎 Similar Papers
No similar papers found.
R
Ramy ElMallah
Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, Canada
K
Krish Chhajer
Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada
Chi-Guhn Lee
Chi-Guhn Lee
University of Toronto
Operations ResearchMarkov Decision ProcessesReinforcement Learning