Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-Top Manipulation

📅 2025-08-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited generality and robustness of robotic manipulation in unstructured desktop environments, this paper proposes GVF-TAPE: a novel end-to-end closed-loop visual planning framework that integrates generative visual foresight—predicting future RGB-D frames—with task-agnostic, decoupled end-effector pose estimation. Unlike prior approaches, GVF-TAPE operates solely from monocular side-view images and natural language task descriptions, requiring no task-specific action annotations. It directly synthesizes executable 6-DoF pose commands in real time for low-level controller execution. Evaluated in both simulation and physical settings, GVF-TAPE demonstrates strong cross-task generalization across diverse manipulation behaviors—including grasping, pushing, and placing—while significantly reducing reliance on task-customized training data. The framework enhances operational robustness under environmental uncertainty and improves scalability to unseen tasks without retraining or fine-tuning.

Technology Category

Application Category

📝 Abstract
Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce {GVF-TAPE}, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation. GVF-TAPE employs a generative video model to predict future RGB-D frames from a single side-view RGB image and a task description, offering visual plans that guide robot actions. A decoupled pose estimation model then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers. By iteratively integrating video foresight and pose estimation in a closed loop, GVF-TAPE achieves real-time, adaptive manipulation across a broad range of tasks. Extensive experiments in both simulation and real-world settings demonstrate that our approach reduces reliance on task-specific action data and generalizes effectively, providing a practical and scalable solution for intelligent robotic systems.
Problem

Research questions and friction points this paper is trying to address.

Generalizing robotic manipulation across diverse unstructured environments
Reducing reliance on task-specific action data for robots
Combining visual foresight with pose estimation for scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative video model predicts future RGB-D frames
Decoupled pose estimation extracts end-effector poses
Closed-loop integration enables real-time adaptive manipulation
🔎 Similar Papers
No similar papers found.
C
Chuye Zhang
Southern University of Science and Technology
X
Xiaoxiong Zhang
Southern University of Science and Technology
W
Wei Pan
Southern University of Science and Technology
Linfang Zheng
Linfang Zheng
Postdoctoral Fellow in The University of Hong Kong
Computer VisionRobotics
W
Wei Zhang
Southern University of Science and Technology, LimX Dynamics