🤖 AI Summary
Existing Transformer-based approaches struggle to achieve real-time multi-person 2D pose estimation (MPPE). To address this, we propose the first end-to-end, post-processing-free real-time Transformer architecture for MPPE. Our method builds upon the DETR framework but introduces three key innovations: (1) learnable positive and negative queries to enhance keypoint query quality; (2) a lightweight decoder coupled with an adaptive keypoint similarity metric, enabling end-to-end differentiable matching; and (3) an efficient design that drastically reduces parameter count and training cost. Without model quantization, our model achieves real-time inference speed (≥30 FPS) while matching or surpassing state-of-the-art accuracy. Training converges 5–10× faster, requiring significantly fewer epochs. This work establishes a new paradigm for real-time, high-accuracy, end-to-end MPPE.
📝 Abstract
Multi-person pose estimation (MPPE) estimates keypoints for all individuals present in an image. MPPE is a fundamental task for several applications in computer vision and virtual reality. Unfortunately, there are currently no transformer-based models that can perform MPPE in real time. The paper presents a family of transformer-based models capable of performing multi-person 2D pose estimation in real-time. Our approach utilizes a modified decoder architecture and keypoint similarity metrics to generate both positive and negative queries, thereby enhancing the quality of the selected queries within the architecture. Compared to state-of-the-art models, our proposed models train much faster, using 5 to 10 times fewer epochs, with competitive inference times without requiring quantization libraries to speed up the model. Furthermore, our proposed models provide competitive results or outperform alternative models, often using significantly fewer parameters.