VICX: Generalizable Robot Manipulation via Video Generation and In-Context Operator Network

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work proposes VICX, a framework designed to enable generalizable robotic manipulation in unseen environments by decoupling high-level task reasoning from low-level execution. VICX leverages a frozen video generation model to produce visual plans at the task level and maps these plans into executable state trajectories via a task-agnostic V2T-ICON network. The approach uniquely integrates video generation with a retrieval-based in-context prompting mechanism, combining segmentation-based visual observations and image-to-state pair prompts to achieve dual generalization across both task semantics and execution policies. Evaluated on the Meta-World benchmark, VICX demonstrates substantial improvements in cross-task generalization, cross-embodiment transfer, and closed-loop self-correction, significantly enhancing the robustness and versatility of robotic manipulation.

📝 Abstract

Generalizable robot manipulation requires not only task-level reasoning over unseen scenes, but also reliable grounding of visual plans into embodiment-specific execution. To bridge this gap, we propose VICX (Video generation and In-Context eXecution), a decoupled closed-loop manipulation framework. In VICX, a frozen video generation model produces vision-language-conditioned high-level visual plans, while a Video-to-Trajectory In-Context Operator Network (V2T-ICON) serves as the task-agnostic interface that grounds these plans into executable robot-state trajectories. To improve execution generalization, V2T-ICON operates on segmentation-extracted arm-only frame observations and uses retrieved image-state pairs as in-context prompts, allowing a robust and generalizable visual-to-state mapping at inference time without parameter updates. Experiments on Meta-World show that VICX supports cross-task generalization, closed-loop self-correction, and cross-embodiment transfer, demonstrating dual generalization across both task semantics and robot execution. The project webpage can be found here: https://scaling-group.github.io/vicx/.

Problem

Research questions and friction points this paper is trying to address.

generalizable robot manipulation

visual planning

embodiment-specific execution

task-level reasoning

cross-embodiment transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

video generation

in-context learning

robot manipulation