🤖 AI Summary
VISUALCENT addresses the limited generalizability of pose estimation and instance segmentation, as well as insufficient robustness to occlusion and motion, in multi-person visual analysis. It proposes a unified bottom-up framework centered on a dynamic centroid representation mechanism: introducing KeyCentroid (keypoint centroid) and MaskCentroid (mask centroid), jointly leveraging disk-shaped heatmap modeling and explicit centroid-driven pixel clustering to enable co-optimization of keypoint detection and instance segmentation. This paradigm significantly enhances resilience to severe occlusion and rapid, large-scale motion. Evaluated on COCO and OCHuman benchmarks, VISUALCENT achieves state-of-the-art performance in both mAP and FPS, enabling real-time, high-accuracy multi-person analysis. The implementation is publicly available.
📝 Abstract
We introduce VISUALCENT, a unified human pose and instance segmentation framework to address generalizability and scalability limitations to multi person visual human analysis. VISUALCENT leverages centroid based bottom up keypoint detection paradigm and uses Keypoint Heatmap incorporating Disk Representation and KeyCentroid to identify the optimal keypoint coordinates. For the unified segmentation task, an explicit keypoint is defined as a dynamic centroid called MaskCentroid to swiftly cluster pixels to specific human instance during rapid changes in human body movement or significantly occluded environment. Experimental results on COCO and OCHuman datasets demonstrate VISUALCENTs accuracy and real time performance advantages, outperforming existing methods in mAP scores and execution frame rate per second. The implementation is available on the project page.