🤖 AI Summary
This work addresses the limitations of existing visual world models in contact-rich robotic manipulation tasks, where the absence of tactile feedback constrains performance. To overcome this, the authors propose a unified visuo-tactile world model that jointly predicts actions, future visual observations, and tactile dynamics through a contact-gated multimodal fusion mechanism and a contact-aware attention bias. Additionally, they introduce a cached diffusion acceleration strategy enabling real-time inference at both action and observation levels. Evaluated on six contact-intensive manipulation tasks, the proposed method achieves an average improvement of 31.7% in action prediction accuracy, while accelerating training and inference by factors of 2.9× and 1.8×, respectively.
📝 Abstract
World action models inherit the predictive capability of world models, enabling action generation to be guided by anticipated future observations. However, they rely primarily on vision and often fail in contact-rich manipulation, where critical cues arise from physical interaction. In this paper, we propose Dream-Tac, a unified Tactile-World Action Model that jointly models actions, future visual observations, and tactile dynamics. Specifically, Dream-Tac introduces (i) contact-gated visuotactile fusion to selectively integrate tactile signals and (ii) a contact-aware attention bias to better regulate cross-modal interactions during manipulation. To support real-time deployment, we further design a dual-level acceleration strategy, reformulating the contact-aware bias to preserve the fused attention path during training and introducing cache-based diffusion acceleration at inference, achieving up to 2.9$\times$ faster training and 1.8$\times$ faster inference. Across six contact-rich manipulation tasks, Dream-Tac improves action accuracy by 31.7\% on average, demonstrating the effectiveness of unified visuotactile world modeling.Code is available at https://github.com/LYFCLOUDFAN/Dream-Tac.