🤖 AI Summary
To address the limited generality and robustness of robotic manipulation in unstructured desktop environments, this paper proposes GVF-TAPE: a novel end-to-end closed-loop visual planning framework that integrates generative visual foresight—predicting future RGB-D frames—with task-agnostic, decoupled end-effector pose estimation. Unlike prior approaches, GVF-TAPE operates solely from monocular side-view images and natural language task descriptions, requiring no task-specific action annotations. It directly synthesizes executable 6-DoF pose commands in real time for low-level controller execution. Evaluated in both simulation and physical settings, GVF-TAPE demonstrates strong cross-task generalization across diverse manipulation behaviors—including grasping, pushing, and placing—while significantly reducing reliance on task-customized training data. The framework enhances operational robustness under environmental uncertainty and improves scalability to unseen tasks without retraining or fine-tuning.
📝 Abstract
Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce {GVF-TAPE}, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation. GVF-TAPE employs a generative video model to predict future RGB-D frames from a single side-view RGB image and a task description, offering visual plans that guide robot actions. A decoupled pose estimation model then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers. By iteratively integrating video foresight and pose estimation in a closed loop, GVF-TAPE achieves real-time, adaptive manipulation across a broad range of tasks. Extensive experiments in both simulation and real-world settings demonstrate that our approach reduces reliance on task-specific action data and generalizes effectively, providing a practical and scalable solution for intelligent robotic systems.