π€ AI Summary
This paper addresses the multimodal modeling challenge in human trajectory prediction by proposing an efficient and robust one-stage generative approach. Methodologically, it introduces: (1) a novel flow-matching loss that jointly optimizes single-sample accuracy and multi-sample diversity; (2) the first implicit maximum likelihood estimation (IMLE)-based knowledge distillation framework for flow models, requiring only teacher samplingβno explicit density evaluation; and (3) a conditional flow-matching architecture enabling joint modeling of historical trajectories and scene context, generating K physically plausible and socially compliant future trajectories in a single forward pass. Evaluated on SportVU, ETH-UCY, and SDD benchmarks, the method achieves state-of-the-art performance. Notably, the distilled student model attains a 100Γ speedup in inference latency over the teacher, without compromising prediction quality.
π Abstract
In this paper, we address the problem of human trajectory forecasting, which aims to predict the inherently multi-modal future movements of humans based on their past trajectories and other contextual cues. We propose a novel motion prediction conditional flow matching model, termed MoFlow, to predict K-shot future trajectories for all agents in a given scene. We design a novel flow matching loss function that not only ensures at least one of the $K$ sets of future trajectories is accurate but also encourages all $K$ sets of future trajectories to be diverse and plausible. Furthermore, by leveraging the implicit maximum likelihood estimation (IMLE), we propose a novel distillation method for flow models that only requires samples from the teacher model. Extensive experiments on the real-world datasets, including SportVU NBA games, ETH-UCY, and SDD, demonstrate that both our teacher flow model and the IMLE-distilled student model achieve state-of-the-art performance. These models can generate diverse trajectories that are physically and socially plausible. Moreover, our one-step student model is $ extbf{100}$ times faster than the teacher flow model during sampling. The code, model, and data are available at our project page: https://moflow-imle.github.io