🤖 AI Summary
Existing autonomous driving motion prediction and planning methods predominantly adopt a one-to-one query-trajectory paradigm, which struggles to accurately model complex spatiotemporal evolution, often leading to collisions or suboptimal decisions. To address this, we propose a motion-decoupled framework that decomposes prediction into two synergistic branches: (i) holistic intent—capturing multimodal directional preferences—and (ii) fine-grained spatiotemporal state—modeling dynamic trajectory evolution. We further introduce a cross-scenario trajectory interaction mechanism for joint modeling. Our architecture fuses attention and Mamba modules to balance efficient scene understanding with precise temporal modeling, and supports self-optimizing trajectory refinement. Evaluated on four major benchmarks—Argoverse 2, nuScenes, nuPlan, and NAVSIM—our method achieves state-of-the-art performance across motion prediction, motion planning, and end-to-end driving tasks.
📝 Abstract
Motion forecasting and planning are tasked with estimating the trajectories of traffic agents and the ego vehicle, respectively, to ensure the safety and efficiency of autonomous driving systems in dynamically changing environments. State-of-the-art methods typically adopt a one-query-one-trajectory paradigm, where each query corresponds to a unique trajectory for predicting multi-mode trajectories. While this paradigm can produce diverse motion intentions, it often falls short in modeling the intricate spatiotemporal evolution of trajectories, which can lead to collisions or suboptimal outcomes. To overcome this limitation, we propose DeMo++, a framework that decouples motion estimation into two distinct components: holistic motion intentions to capture the diverse potential directions of movement, and fine spatiotemporal states to track the agent's dynamic progress within the scene and enable a self-refinement capability. Further, we introduce a cross-scene trajectory interaction mechanism to explore the relationships between motions in adjacent scenes. This allows DeMo++ to comprehensively model both the diversity of motion intentions and the spatiotemporal evolution of each trajectory. To effectively implement this framework, we developed a hybrid model combining Attention and Mamba. This architecture leverages the strengths of both mechanisms for efficient scene information aggregation and precise trajectory state sequence modeling. Extensive experiments demonstrate that DeMo++ achieves state-of-the-art performance across various benchmarks, including motion forecasting (Argoverse 2 and nuScenes), motion planning (nuPlan), and end-to-end planning (NAVSIM).