DETRPose: Real-time end-to-end transformer model for multi-person pose estimation

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing Transformer-based approaches struggle to achieve real-time multi-person 2D pose estimation (MPPE). To address this, we propose the first end-to-end, post-processing-free real-time Transformer architecture for MPPE. Our method builds upon the DETR framework but introduces three key innovations: (1) learnable positive and negative queries to enhance keypoint query quality; (2) a lightweight decoder coupled with an adaptive keypoint similarity metric, enabling end-to-end differentiable matching; and (3) an efficient design that drastically reduces parameter count and training cost. Without model quantization, our model achieves real-time inference speed (≥30 FPS) while matching or surpassing state-of-the-art accuracy. Training converges 5–10× faster, requiring significantly fewer epochs. This work establishes a new paradigm for real-time, high-accuracy, end-to-end MPPE.

Technology Category

Application Category

📝 Abstract

Multi-person pose estimation (MPPE) estimates keypoints for all individuals present in an image. MPPE is a fundamental task for several applications in computer vision and virtual reality. Unfortunately, there are currently no transformer-based models that can perform MPPE in real time. The paper presents a family of transformer-based models capable of performing multi-person 2D pose estimation in real-time. Our approach utilizes a modified decoder architecture and keypoint similarity metrics to generate both positive and negative queries, thereby enhancing the quality of the selected queries within the architecture. Compared to state-of-the-art models, our proposed models train much faster, using 5 to 10 times fewer epochs, with competitive inference times without requiring quantization libraries to speed up the model. Furthermore, our proposed models provide competitive results or outperform alternative models, often using significantly fewer parameters.

Problem

Research questions and friction points this paper is trying to address.

Real-time multi-person pose estimation using transformers

Faster training with fewer epochs than current models

Competitive accuracy with fewer parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based real-time multi-person pose estimation

Modified decoder with keypoint similarity metrics

Fewer epochs and parameters, competitive performance

🔎 Similar Papers

No similar papers found.

Authors to Follow