DexFuture: Hierarchical Future-State Visuomotor Targeting for Bimanual Dexterous Tool Use

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge of dexterous tool manipulation with bimanual robot hands under high-dimensional configurations and complex hand–tool–object dynamics. The authors propose a hierarchical architecture: at the high level, a visuomotor embedding combined with a temporally conditioned Transformer predicts multi-step future target trajectories from egocentric RGB images, proprioception, and geometric history; at the low level, a goal-conditioned, per-joint Transformer policy executes high-frequency control. This approach generates dynamically consistent reference trajectories without relying on privileged states from demonstrations or slow counterfactual planning, enabling efficient real-time bimanual dexterity for the first time. On the OakInk2 benchmark, it achieves 90% of oracle performance—substantially outperforming reference-free policies (7%)—and runs in real time at 60 Hz, approximately 250× faster than DexWM.

📝 Abstract

Bimanual dexterous tool use remains challenging for robots due to high-dimensional hand configurations and complex hand-tool-object dynamics and contact. Most existing control policies depend on future configuration references provided from demonstrations, while future action-conditioned world models require slow online planning over high-dimensional action sequences. A significant challenge is generating a dynamically consistent future reference trajectory without relying on privileged states from demonstrations or slow counterfactual planning. We propose DexFuture, a hierarchical system that couples a high-level Future-State Visuomotor Target Predictor with a low-level Target-Conditioned Structured Dexterous Policy. Conditioned on egocentric RGB, proprioceptive and geometric history, the high-level predictor constructs structured hand-tool-object visuomotor embeddings and uses a horizon-conditioned transformer to generate a multi-step future target trajectory. Then, the low-level policy tracks them with a target-conditioned per-link transformer. This hierarchy decouples coarse future reference generation from fine-grained action control, and slow long-horizon semantic prediction from high-frequency execution. On OakInk2 bimanual tool-use tasks, DexFuture achieves 90% of the privileged-oracle performance, compared to 7% for a no-reference policy. DexFuture operates at 60 Hz, approximately 250 times faster than DexWM-style Cross-Entropy Method (CEM) planning with a future action-conditioned world model.

Problem

Research questions and friction points this paper is trying to address.

bimanual dexterous tool use

future reference trajectory

dynamically consistent

high-dimensional action sequences

visuomotor targeting

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical control

future-state prediction

visuomotor targeting