PhysiFlow: Physics-Aware Humanoid Whole-Body VLA via Multi-Brain Latent Flow Matching and Robust Tracking

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision–language–action (VLA) approaches for humanoid robots often suffer from unstable dynamic task execution due to inefficient reasoning or insufficient semantic guidance in whole-body coordination control. To address this, this work proposes a semantic–motor-intention-guided, physics-aware multi-brain VLA framework that, for the first time, integrates multi-brain latent flow matching with physics-based constraint modeling. By robustly fusing visual, linguistic, and motor signals through intention-aligned tracking, the method enables efficient and semantically grounded whole-body coordination. This approach significantly enhances the stability and reliability of humanoid robots performing dynamic tasks under vision–language guidance.

Technology Category

Application Category

📝 Abstract
In the domain of humanoid robot control, the fusion of Vision-Language-Action (VLA) with whole-body control is essential for semantically guided execution of real-world tasks. However, existing methods encounter challenges in terms of low VLA inference efficiency or an absence of effective semantic guidance for whole-body control, resulting in instability in dynamic limb-coordinated tasks. To bridge this gap, we present a semantic-motion intent guided, physics-aware multi-brain VLA framework for humanoid whole-body control. A series of experiments was conducted to evaluate the performance of the proposed framework. The experimental results demonstrated that the framework enabled reliable vision-language-guided full-body coordination for humanoid robots.
Problem

Research questions and friction points this paper is trying to address.

humanoid robot control
Vision-Language-Action
whole-body control
semantic guidance
dynamic limb coordination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Physics-Aware
Multi-Brain Latent Flow Matching
Vision-Language-Action (VLA)
Whole-Body Control
Semantic-Motion Intent
🔎 Similar Papers
No similar papers found.
W
Weikai Qin
Department of Automation Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
S
Sichen Wu
Department of Automation Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
C
Ci Chen
Department of Automation Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
Mengfan Liu
Mengfan Liu
The University of Hong Kong
distributed machine learning system
L
Linxi Feng
Department of Automation Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
X
Xinru Cui
Department of Automation Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
H
Haoqi Han
Department of Automation Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
H
Hesheng Wang
Department of Automation, Key Laboratory of System Control and Information Processing of Ministry of Education, Key Laboratory of Marine Intelligent Equipment and System of Ministry of Education, Shanghai Engineering Research Center of Intelligent Control and Management, Shanghai Jiao Tong University, Shanghai, 200240, China