🤖 AI Summary
This work addresses the limitations of existing perception-driven humanoid gait controllers, which rely on explicit geometric intermediate representations and struggle with vertical structures, perforated obstacles, and complex real-world terrains. The authors propose CReF, a novel framework that achieves, for the first time, single-stage end-to-end deep conditional humanoid locomotion by directly learning motion features from raw forward-facing depth images. CReF integrates visual and proprioceptive information through cross-modal attention guided by proprioceptive queries, and employs gated residual fusion blocks together with temporal GRUs to enable stable obstacle traversal. A terrain-aware foot placement reward mechanism and a high-speed output gating scheme are innovatively introduced to significantly enhance interaction with challenging terrains. Extensive experiments in both simulation and on a physical robot demonstrate robust navigation across railings, perforated trays, highly reflective surfaces, and visually cluttered outdoor environments, with successful zero-shot transfer to the real world.
📝 Abstract
Stable traversal over geometrically complex terrain increasingly requires exteroceptive perception, yet prior perceptive humanoid locomotion methods often remain tied to explicit geometric abstractions, either by mediating control through robot-centric 2.5D terrain representations or by shaping depth learning with auxiliary geometry-related targets. Such designs inherit the representational bias of the intermediate or supervisory target and can be restrictive for vertical structures, perforated obstacles, and complex real-world clutter. We propose CReF (Cross-modal and Recurrent Fusion), a single-stage depth-conditioned humanoid locomotion framework that learns locomotion-relevant features directly from raw forward-facing depth without explicit geometric intermediates. CReF couples proprioception and depth tokens through proprioception-queried cross-modal attention, fuses the resulting representation with a gated residual fusion block, and performs temporal integration with a Gated Recurrent Unit (GRU) regulated by a highway-style output gate for state-dependent blending of recurrent and feedforward features. To further improve terrain interaction, we introduce a terrain-aware foothold placement reward that extracts supportable foothold candidates from foot-end point-cloud samples and rewards touchdown locations that lie close to the nearest supportable candidate. Experiments in simulation and on a physical humanoid demonstrate robust traversal over diverse terrains and effective zero-shot transfer to real-world scenes containing handrails, hollow pallet assemblies, severe reflective interference, and visually cluttered outdoor surroundings.