CReF: Cross-modal and Recurrent Fusion for Depth-conditioned Humanoid Locomotion

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the limitations of existing perception-driven humanoid gait controllers, which rely on explicit geometric intermediate representations and struggle with vertical structures, perforated obstacles, and complex real-world terrains. The authors propose CReF, a novel framework that achieves, for the first time, single-stage end-to-end deep conditional humanoid locomotion by directly learning motion features from raw forward-facing depth images. CReF integrates visual and proprioceptive information through cross-modal attention guided by proprioceptive queries, and employs gated residual fusion blocks together with temporal GRUs to enable stable obstacle traversal. A terrain-aware foot placement reward mechanism and a high-speed output gating scheme are innovatively introduced to significantly enhance interaction with challenging terrains. Extensive experiments in both simulation and on a physical robot demonstrate robust navigation across railings, perforated trays, highly reflective surfaces, and visually cluttered outdoor environments, with successful zero-shot transfer to the real world.

Technology Category

Application Category

📝 Abstract

Stable traversal over geometrically complex terrain increasingly requires exteroceptive perception, yet prior perceptive humanoid locomotion methods often remain tied to explicit geometric abstractions, either by mediating control through robot-centric 2.5D terrain representations or by shaping depth learning with auxiliary geometry-related targets. Such designs inherit the representational bias of the intermediate or supervisory target and can be restrictive for vertical structures, perforated obstacles, and complex real-world clutter. We propose CReF (Cross-modal and Recurrent Fusion), a single-stage depth-conditioned humanoid locomotion framework that learns locomotion-relevant features directly from raw forward-facing depth without explicit geometric intermediates. CReF couples proprioception and depth tokens through proprioception-queried cross-modal attention, fuses the resulting representation with a gated residual fusion block, and performs temporal integration with a Gated Recurrent Unit (GRU) regulated by a highway-style output gate for state-dependent blending of recurrent and feedforward features. To further improve terrain interaction, we introduce a terrain-aware foothold placement reward that extracts supportable foothold candidates from foot-end point-cloud samples and rewards touchdown locations that lie close to the nearest supportable candidate. Experiments in simulation and on a physical humanoid demonstrate robust traversal over diverse terrains and effective zero-shot transfer to real-world scenes containing handrails, hollow pallet assemblies, severe reflective interference, and visually cluttered outdoor surroundings.

Problem

Research questions and friction points this paper is trying to address.

humanoid locomotion

depth perception

geometric abstraction

terrain traversal

exteroceptive perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal fusion

Depth-conditioned locomotion

Gated Recurrent Unit

Terrain-aware foothold placement

Zero-shot transfer

🔎 Similar Papers

AdaptiveFusion: Adaptive Multi-Modal Multi-View Fusion for 3D Human Body Reconstruction

2024-09-07arXiv.orgCitations: 1