VIP: Vision Instructed Pre-training for Robotic Manipulation

📅 2024-10-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
In dexterous robotic manipulation, task diversity causes policy confusion; natural-language instructions are poorly grounded in robot data, while single-frame visual cues fail to capture dynamic target state changes. Method: We replace text-based task specifications with visual instructions as the core task modality, and introduce sparse point-flow encoding to model fine-grained inter-frame object dynamics. Our end-to-end vision-guided pretraining framework integrates sparse optical flow representations, a cross-frame action prediction network, and real-sim co-pretraining. Contribution/Results: The method significantly improves generalization across unseen tasks. It achieves breakthrough performance on high-difficulty embodied manipulation tasks—e.g., “opening a tightly sealed bottle cap”—and, for the first time, enables end-to-end pretraining and deployment of robotic manipulation policies solely from visual instructions.

Technology Category

Application Category

📝 Abstract
The effectiveness of scaling up training data in robotic manipulation is still limited. A primary challenge in manipulation is the tasks are diverse, and the trained policy would be confused if the task targets are not specified clearly. Existing works primarily rely on text instruction to describe targets. However, we reveal that current robotic data cannot train policies to understand text instruction effectively, and vision is much more comprehensible. Therefore, we introduce utilizing vision instruction to specify targets. A straightforward implementation is training a policy to predict the intermediate actions linking the current observation and a future image. Nevertheless, a single future image does not describe the task target in insufficient detail. To handle this problem, we propose to use sparse point flows to provide more detailed information. Extensive tasks are designed based on real and simulated environments to evaluate the effectiveness of our vision instructed pre-training (VIP) method. The results indicate VIP improves the performance on diverse tasks significantly, and the derived policy can complete competitive tasks like ``opening the lid of a tightly sealed bottle''.
Problem

Research questions and friction points this paper is trying to address.

Improve robotic manipulation with vision instructions
Address task diversity in robotic policies
Enhance target specification using sparse point flows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision instructed pre-training
Sparse point flows
Intermediate action prediction
💼 Related Jobs
Vision Foundation Model Research Intern
Intrinsic
Salary Range$57.69—$57.69 USDAt Intrinsic, we are proud to be an equal opportunity workplace. Employment at Intrinsic is based solely on a person's merit and qualifications directly related to professional competence. Intrinsic does not discriminate against any employee or applicant because of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), or any other basis protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. It is Intrinsic’s policy to comply with all applicable national, state and local laws pertaining to nondiscrimination and equal opportunity.
Mountain View, California / Mountain View (US-MTV), Mountain View, California, United States
Z
Zhuoling Li
HKU
L
Liangliang Ren
CVTE
J
Jinrong Yang
CVTE
Y
Yong Zhao
CVTE
X
Xiaoyang Wu
HKU
Z
Zhenhua Xu
HKU
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR
Hengshuang Zhao
Hengshuang Zhao
The University of Hong Kong
Computer VisionMachine LearningArtificial Intelligence