DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing synchronous multimodal processing struggles to align with the inherently asynchronous update frequencies of different modalities in physical interaction, leading to oversampling of slow modalities and undersampling of fast ones, which limits action generation performance. This work proposes the first decoupled asynchronous Vision-Language-Action (VLA) model, maintaining independent latent buffers for each modality updated at their native frequencies. Efficient fusion is achieved through a gated cross-attention mechanism while preserving the pretrained backbone architectures. Evaluated on seven real-world, contact-rich manipulation tasks, the method achieves an average success rate of 95.2%, substantially outperforming the strongest synchronous baseline (40.95%) and enabling smooth, responsive control at 100 Hz.

📝 Abstract

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2\% vs.\ 40.95\%) while sustaining smooth, reactive 100\,Hz control. Project website: \href{https://intuitive-robots.github.io/DAM-VLA/}{intuitive-robots.github.io/DAM-VLA/}

Problem

Research questions and friction points this paper is trying to address.

vision-language-action

asynchronous processing

multimodal perception

sensor rates

temporal decoupling

Innovation

Methods, ideas, or system contributions that make the work stand out.

asynchronous multimodal processing

decoupled temporal modeling

vision-language-action models