See Selectively, Act Adaptively: Dual-Level Structural Decomposition for Bimanual Robot Manipulation

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges in dual-arm robotic manipulation arising from dynamic shifts in visual attention across task phases and the alternating nature of arm interactions between independent and coordinated modes, which complicate policy learning. To tackle this, the authors propose a novel two-level vision–language–action framework that jointly integrates view-selective visual routing and an interaction-aware Mixture-of-Experts (MoE) action policy. This architecture explicitly decouples visual processing from bimanual coordination structure, thereby introducing effective inductive bias. Evaluated on the RoboTwin 2.0 simulation platform, the method achieves a 27.7% average improvement in success rate in simulation and a 43.3% gain on long-horizon real-world tasks, significantly outperforming single-module baselines.

📝 Abstract

In bimanual robotic manipulation, task-relevant visual information varies with the task stage and context, while the interaction of the two arms shifts between independent and coordinated modes, making policy learning challenging. However, existing monolithic Vision-Language-Action (VLA) policies process diverse visual inputs and interaction patterns through a single shared representation and action generation pathway, often failing to separately account for visual relevance and bimanual interaction structure. To address this issue, we propose a bimanual manipulation VLA framework based on Dual-Level Structural Decomposition. The View-Selective Visual Router dynamically adjusts wrist-view contributions to emphasize relevant visual cues, while the Interaction-Aware Action Mixture-of-Experts (MoE) decomposes action generation into coordinated and arm-wise pathways to adapt to varying bimanual interaction modes. We evaluate the proposed method on six simulated bimanual manipulation tasks in RoboTwin 2.0 and three long-horizon real-world tasks. Our model improves the overall average success rate over a monolithic baseline by 27.7% in simulation and 43.3% in real-world evaluation, while consistently outperforming single-module variants across both settings. These results demonstrate that jointly considering selective visual processing and explicit decomposition of bimanual interaction structures provides an effective inductive bias for robust bimanual manipulation.

Problem

Research questions and friction points this paper is trying to address.

bimanual manipulation

visual relevance

interaction structure

Vision-Language-Action policy

policy learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Level Structural Decomposition

View-Selective Visual Router

Interaction-Aware MoE