Graph-Fused Vision-Language-Action for Policy Reasoning in Multi-Arm Robotic Manipulation

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generalizing human video demonstrations to dexterous dual-arm robotic manipulation remains challenging due to over-reliance on low-level trajectory imitation, which fails under variations in object identity, spatial layout, and robot configuration. Method: We propose a vision-language-action graph fusion framework that models spatiotemporal interactions via scene graphs, incorporates an information-theoretic mechanism for extracting critical hand-object and object-object relations, and employs a language-conditioned Transformer to generate interpretable, hierarchical behavior trees and atomic action primitives—enabling autonomous, cross-arm grasp allocation without explicit modeling. Results: Evaluated on four dual-arm block assembly tasks, our approach achieves >95% graph structural accuracy, 93% subtask segmentation precision, and—upon deployment—94% grasp reliability, 89% placement accuracy, and 90% overall task success rate, demonstrating substantial improvements in robustness and cross-scenario generalization.

Technology Category

Application Category

📝 Abstract
Acquiring dexterous robotic skills from human video demonstrations remains a significant challenge, largely due to conventional reliance on low-level trajectory replication, which often fails to generalize across varying objects, spatial layouts, and manipulator configurations. To address this limitation, we introduce Graph-Fused Vision-Language-Action (GF-VLA), a unified framework that enables dual-arm robotic systems to perform task-level reasoning and execution directly from RGB-D human demonstrations. GF-VLA employs an information-theoretic approach to extract task-relevant cues, selectively highlighting critical hand-object and object-object interactions. These cues are structured into temporally ordered scene graphs, which are subsequently integrated with a language-conditioned transformer to produce hierarchical behavior trees and interpretable Cartesian motion primitives. To enhance efficiency in bimanual execution, we propose a cross-arm allocation strategy that autonomously determines gripper assignment without requiring explicit geometric modeling. We validate GF-VLA on four dual-arm block assembly benchmarks involving symbolic structure construction and spatial generalization. Empirical results demonstrate that the proposed representation achieves over 95% graph accuracy and 93% subtask segmentation, enabling the language-action planner to generate robust, interpretable task policies. When deployed on a dual-arm robot, these policies attain 94% grasp reliability, 89% placement accuracy, and 90% overall task success across stacking, letter-formation, and geometric reconfiguration tasks, evidencing strong generalization and robustness under diverse spatial and semantic variations.
Problem

Research questions and friction points this paper is trying to address.

Enabling dual-arm robots to generalize manipulation skills from human video demonstrations
Overcoming limitations of low-level trajectory replication across varying conditions
Generating robust task policies through hierarchical reasoning and scene graphs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-Fused VLA framework for task-level reasoning
Information-theoretic scene graph extraction from RGB-D
Cross-arm allocation strategy without geometric modeling
Shunlei Li
Shunlei Li
The Chinese University of Hong Kong
RoboticsComputer VisionAI for Science
L
Longsen Gao
Electrical and Computer Engineering Department, The University of New Mexico, Albuquerque, NM 87131, USA
J
Jiuwen Cao
Machine Learning and I-health International Cooperation Base of Zhejiang Province, Artificial Intelligence Institute, Hangzhou Dianzi University, Zhejiang, 310018, China
Yingbai Hu
Yingbai Hu
The Chinese University of Hong Kong | Technische Universität München
robot learningrobot controlmedical robot