Act, Sense, Act: Learning Non-Markovian Active Perception Strategies from Large-Scale Egocentric Human Data

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of robotic manipulation in unstructured environments, where perception is limited and information is inherently uncertain. The authors propose CoMe-VLA, a novel vision-language-action framework that, for the first time, formulates active perception as a non-Markovian process driven by information gain and decision branching. By integrating a cognitive auxiliary head with a dual-track memory system, the architecture enables continuous self- and environment-state awareness. Built upon large-scale human egocentric hand-eye coordination data, CoMe-VLA undergoes progressive three-stage training within a unified action space. Experimental results demonstrate that the approach significantly improves task success rates on wheeled humanoid robots in complex, long-horizon manipulation scenarios, exhibiting strong robustness and generalization capabilities.

Technology Category

Application Category

📝 Abstract
Achieving generalizable manipulation in unconstrained environments requires the robot to proactively resolve information uncertainty, i.e., the capability of active perception. However, existing methods are often confined in limited types of sensing behaviors, restricting their applicability to complex environments. In this work, we formalize active perception as a non-Markovian process driven by information gain and decision branching, providing a structured categorization of visual active perception paradigms. Building on this perspective, we introduce CoMe-VLA, a cognitive and memory-aware vision-language-action (VLA) framework that leverages large-scale human egocentric data to learn versatile exploration and manipulation priors. Our framework integrates a cognitive auxiliary head for autonomous sub-task transitions and a dual-track memory system to maintain consistent self and environmental awareness by fusing proprioceptive and visual temporal contexts. By aligning human and robot hand-eye coordination behaviors in a unified egocentric action space, we train the model progressively in three stages. Extensive experiments on a wheel-based humanoid have demonstrated strong robustness and adaptability of our proposed method across diverse long-horizon tasks spanning multiple active perception scenarios.
Problem

Research questions and friction points this paper is trying to address.

active perception
non-Markovian
information uncertainty
generalizable manipulation
egocentric data
Innovation

Methods, ideas, or system contributions that make the work stand out.

non-Markovian active perception
vision-language-action (VLA)
egocentric human data
dual-track memory system
cognitive auxiliary head
🔎 Similar Papers
No similar papers found.
J
Jialiang Li
School of Artificial Intelligence, Shanghai Jiao Tong University
Y
Yi Qiao
School of Artificial Intelligence, Shanghai Jiao Tong University
Y
Yunhan Guo
School of Artificial Intelligence, Shanghai Jiao Tong University
C
Changwen Chen
School of Artificial Intelligence, Shanghai Jiao Tong University
Wenzhao Lian
Wenzhao Lian
Google X
Roboticsmachine learning