🤖 AI Summary
This work addresses the challenge of translating high-level human intentions into stable, safe, and adaptive whole-body motion control in human–robot collaboration. The authors propose a novel three-layer cognitive-control architecture that explicitly integrates System 2–style deliberative reasoning with System 1–style rapid reactive mechanisms—a first in this domain. The framework synergistically combines vision-language models, multi-agent reinforcement learning under decentralized Markov potential games, and whole-body dynamics control, enabling role-agnostic, adaptive coordination without predefined roles. A residual policy further internalizes the human partner’s dynamics to enhance responsiveness. Evaluated on cooperative object-carrying tasks, the system significantly outperforms both single-agent and end-to-end baselines, demonstrating superior success rates, robustness, and the emergence of spontaneous leader–follower behaviors.
📝 Abstract
Effective human-robot collaboration (HRC) requires translating high-level intent into contact-stable whole-body motion while continuously adapting to a human partner. Many vision-language-action (VLA) systems learn end-to-end mappings from observations and instructions to actions, but they often emphasize reactive (System 1-like) behavior and leave under-specified how sustained System 2-style deliberation can be integrated with reliable, low-latency continuous control. This gap is acute in multi-agent HRC, where long-horizon coordination decisions and physical execution must co-evolve under contact, feasibility, and safety constraints. We address this limitation with cognition-to-control (C2C), a three-layer hierarchy that makes the deliberation-to-control pathway explicit: (i) a VLM-based grounding layer that maintains persistent scene referents and infers embodiment-aware affordances/constraints; (ii) a deliberative skill/coordination layer-the System 2 core-that optimizes long-horizon skill choices and sequences under human-robot coupling via decentralized MARL cast as a Markov potential game with a shared potential encoding task progress; and (iii) a whole-body control layer that executes the selected skills at high frequency while enforcing kinematic/dynamic feasibility and contact stability. The deliberative layer is realized as a residual policy relative to a nominal controller, internalizing partner dynamics without explicit role assignment. Experiments on collaborative manipulation tasks show higher success and robustness than single-agent and end-to-end baselines, with stable coordination and emergent leader-follower behaviors.