Unified Human Localization and Trajectory Prediction with Monocular Vision

📅 2025-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing pedestrian trajectory prediction methods rely on high-quality annotated data and specialized sensors, resulting in poor robustness and high deployment costs in real-world robotic applications. To address this, we propose MonoTransmotion (MT), the first Transformer-based framework designed for monocular camera input that jointly learns, in an end-to-end manner, bird’s-eye-view (BEV) human localization and future trajectory prediction. MT introduces a novel directional temporal smoothing loss to enhance the continuity of BEV localization estimates over time. Furthermore, multi-task joint optimization improves generalization under noisy, unlabeled, or suboptimal observations. On standard benchmarks, MT achieves approximately 12% improvement in both BEV localization accuracy and trajectory prediction performance. Crucially, it maintains stable performance in real-world, unannotated scenarios—demonstrating strong robustness and practical deployability for robotic systems.

Technology Category

Application Category

📝 Abstract
Conventional human trajectory prediction models rely on clean curated data, requiring specialized equipment or manual labeling, which is often impractical for robotic applications. The existing predictors tend to overfit to clean observation affecting their robustness when used with noisy inputs. In this work, we propose MonoTransmotion (MT), a Transformer-based framework that uses only a monocular camera to jointly solve localization and prediction tasks. Our framework has two main modules: Bird's Eye View (BEV) localization and trajectory prediction. The BEV localization module estimates the position of a person using 2D human poses, enhanced by a novel directional loss for smoother sequential localizations. The trajectory prediction module predicts future motion from these estimates. We show that by jointly training both tasks with our unified framework, our method is more robust in real-world scenarios made of noisy inputs. We validate our MT network on both curated and non-curated datasets. On the curated dataset, MT achieves around 12% improvement over baseline models on BEV localization and trajectory prediction. On real-world non-curated dataset, experimental results indicate that MT maintains similar performance levels, highlighting its robustness and generalization capability. The code is available at https://github.com/vita-epfl/MonoTransmotion.
Problem

Research questions and friction points this paper is trying to address.

Unified framework for human localization and trajectory prediction.
Robustness in noisy real-world scenarios using monocular vision.
Improved performance on both curated and non-curated datasets.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based framework for monocular vision tasks
Joint BEV localization and trajectory prediction modules
Directional loss enhances sequential localization accuracy
🔎 Similar Papers
No similar papers found.