🤖 AI Summary
Existing motion prediction models are constrained by narrow-distribution training data, limiting their ability to model complex dynamics and generalize to long-horizon, cross-scenario tasks. To address this, we propose Autoregressive Flow Matching (ARFM), the first framework to extend flow matching to probabilistic modeling of continuous-time sequences for high-fidelity long-term prediction of human and robotic point trajectories. ARFM integrates autoregressive temporal modeling with multi-source video-driven training and supports trajectory-conditioned downstream task enhancement. Evaluated on a newly constructed human/robot motion prediction benchmark, ARFM achieves significant improvements: +23.6% in L2 trajectory fidelity for long-horizon generation and +8.4% in action classification accuracy. These results demonstrate that ARFM effectively overcomes key bottlenecks in modeling complex dynamics and enabling cross-scenario generalization.
📝 Abstract
Motion prediction has been studied in different contexts with models trained on narrow distributions and applied to downstream tasks in human motion prediction and robotics. Simultaneously, recent efforts in scaling video prediction have demonstrated impressive visual realism, yet they struggle to accurately model complex motions despite massive scale. Inspired by the scaling of video generation, we develop autoregressive flow matching (ARFM), a new method for probabilistic modeling of sequential continuous data and train it on diverse video datasets to generate future point track locations over long horizons. To evaluate our model, we develop benchmarks for evaluating the ability of motion prediction models to predict human and robot motion. Our model is able to predict complex motions, and we demonstrate that conditioning robot action prediction and human motion prediction on predicted future tracks can significantly improve downstream task performance. Code and models publicly available at: https://github.com/Johnathan-Xie/arfm-motion-prediction.