Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dual-arm cooperative manipulation poses significant challenges for generalization from single-arm vision-language-action (VLA) models due to high-dimensional action spaces, intricate inter-arm coordination requirements, and scarcity of real-world demonstration data. To address this, we propose a novel optical-flow-guided text-to-video generation paradigm, introducing the first “text → optical flow → video” two-stage decomposition architecture. Optical flow serves as a differentiable, motion-explicit intermediate representation that decouples language intent understanding from physical motion modeling, thereby substantially improving action-semantic alignment accuracy. Crucially, our method eliminates reliance on large-scale dual-arm demonstration datasets; instead, it achieves effective fine-tuning using only a small number of simulated or real-robot trajectories. Integrating diffusion-based policy networks, optical flow prediction, and text-to-video generation, our approach is rigorously validated on both simulation and real-world dual-arm robotic platforms, demonstrating strong generalization capability, high inter-arm coordination fidelity, and exceptional data efficiency.

Technology Category

Application Category

📝 Abstract
Learning a generalizable bimanual manipulation policy is extremely challenging for embodied agents due to the large action space and the need for coordinated arm movements. Existing approaches rely on Vision-Language-Action (VLA) models to acquire bimanual policies. However, transferring knowledge from single-arm datasets or pre-trained VLA models often fails to generalize effectively, primarily due to the scarcity of bimanual data and the fundamental differences between single-arm and bimanual manipulation. In this paper, we propose a novel bimanual foundation policy by fine-tuning the leading text-to-video models to predict robot trajectories and training a lightweight diffusion policy for action generation. Given the lack of embodied knowledge in text-to-video models, we introduce a two-stage paradigm that fine-tunes independent text-to-flow and flow-to-video models derived from a pre-trained text-to-video model. Specifically, optical flow serves as an intermediate variable, providing a concise representation of subtle movements between images. The text-to-flow model predicts optical flow to concretize the intent of language instructions, and the flow-to-video model leverages this flow for fine-grained video prediction. Our method mitigates the ambiguity of language in single-stage text-to-video prediction and significantly reduces the robot-data requirement by avoiding direct use of low-level actions. In experiments, we collect high-quality manipulation data for real dual-arm robot, and the results of simulation and real-world experiments demonstrate the effectiveness of our method.
Problem

Research questions and friction points this paper is trying to address.

Learning generalizable bimanual manipulation policies for embodied agents
Overcoming limitations of single-arm datasets and VLA models
Reducing robot-data requirements via flow-based video prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes text-to-video models for robot trajectory prediction
Uses optical flow as intermediate movement representation
Trains lightweight diffusion policy for action generation
🔎 Similar Papers
No similar papers found.
C
Chenyou Fan
Institute of Artificial intelligence (TeleAI), China Telecom; Northwestern Polytechnical University
F
Fangzheng Yan
Institute of Artificial intelligence (TeleAI), China Telecom; Hong Kong University of Science and Technology
Chenjia Bai
Chenjia Bai
Institute of Artificial Intelligence, China Telecom(中国电信人工智能研究院, TeleAI)
Reinforcement LearningRoboticsEmbodied AI
Jiepeng Wang
Jiepeng Wang
The University of Hong Kong
3D VisionAIGCRobotics
C
Chi Zhang
Institute of Artificial intelligence (TeleAI), China Telecom
Z
Zhen Wang
Northwestern Polytechnical University
X
Xuelong Li
Institute of Artificial intelligence (TeleAI), China Telecom