Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generalizing dexterous skill learning for bimanual robots from human demonstration videos, bypassing low-level trajectory imitation to enhance adaptability across diverse objects, spatial configurations, and robotic arm geometries. We propose GF-VLA, a framework that (1) leverages Shannon entropy to identify salient hand–object interactions and construct temporal scene graphs; (2) employs a language-conditioned Transformer to generate interpretable behavior trees and executable motion primitives; and (3) introduces a cross-hand selection strategy to optimize coordinated gripper allocation. The method integrates information-theoretic feature extraction, structured scene modeling, language-guided hierarchical behavior generation, and bimanual closed-loop control. Evaluated on multiple assembly tasks, GF-VLA achieves 95% scene graph accuracy, 93% subtask segmentation precision, 94% grasp success rate, 89% placement accuracy, and 90% overall task completion rate.

Technology Category

Application Category

📝 Abstract
Teaching robots dexterous skills from human videos remains challenging due to the reliance on low-level trajectory imitation, which fails to generalize across object types, spatial layouts, and manipulator configurations. We propose Graph-Fused Vision-Language-Action (GF-VLA), a framework that enables dual-arm robotic systems to perform task-level reasoning and execution directly from RGB and Depth human demonstrations. GF-VLA first extracts Shannon-information-based cues to identify hands and objects with the highest task relevance, then encodes these cues into temporally ordered scene graphs that capture both hand-object and object-object interactions. These graphs are fused with a language-conditioned transformer that generates hierarchical behavior trees and interpretable Cartesian motion commands. To improve execution efficiency in bimanual settings, we further introduce a cross-hand selection policy that infers optimal gripper assignment without explicit geometric reasoning. We evaluate GF-VLA on four structured dual-arm block assembly tasks involving symbolic shape construction and spatial generalization. Experimental results show that the information-theoretic scene representation achieves over 95 percent graph accuracy and 93 percent subtask segmentation, supporting the LLM planner in generating reliable and human-readable task policies. When executed by the dual-arm robot, these policies yield 94 percent grasp success, 89 percent placement accuracy, and 90 percent overall task success across stacking, letter-building, and geometric reconfiguration scenarios, demonstrating strong generalization and robustness across diverse spatial and semantic variations.
Problem

Research questions and friction points this paper is trying to address.

Teaching robots dexterous skills from human videos
Generalizing across object types and spatial layouts
Dual-arm robotic task-level reasoning and execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-Fused Vision-Language-Action for task reasoning
Information-theoretic scene graphs for interaction modeling
Cross-hand selection policy for bimanual efficiency
🔎 Similar Papers
No similar papers found.
Shunlei Li
Shunlei Li
The Chinese University of Hong Kong
RoboticsComputer VisionAI for Science
L
Longsen Gao
Electrical and Computer Engineering Department, University of New Mexico, Albuquerque, United States, 87106
J
Jin Wang
Dynamic Robot Systems Group, Oxford Robotics Institute, University of Oxford, United Kingdom, OX26NN
C
Chang Che
Mechanical and Aerospace Engineering Department, The George Washington University, DC, United States, 22202
Xi Xiao
Xi Xiao
Oak Ridge National Laboratory | University of Alabama at Birmingham
LLM / MLLM EfficiencyImage / Video GenerationImage / Video Understanding
J
Jiuwen Cao
Machine Learning and I-health International Cooperation Base of Zhejiang Province, Artificial Intelligence Institute, Hangzhou Dianzi University, Zhejiang, China, 310018
Yingbai Hu
Yingbai Hu
The Chinese University of Hong Kong | Technische Universität München
robot learningrobot controlmedical robot
Hamid Reza Karimi
Hamid Reza Karimi
Professor, Politecnico di Milano
Control theoryApplied MechanicsMechatronicsFault DiagnosisAutonomous Systems