🤖 AI Summary
Existing person re-identification (ReID) systems typically model face and body recognition separately, limiting generalization under real-world challenges such as multi-pose variations, multi-scale appearances, and partial occlusion. This work introduces the first unified foundation model for human identification and proposes a novel cross-pose–scale ReID task. Methodologically, we design a dynamic retinal partitioning (RP) mechanism for adaptive regional decomposition, incorporate a mask recognition model (MRM) to enable variable-length local representation learning, and develop a semantic attention head (SAH) to aggregate discriminative part-level features—all built upon a Transformer architecture and trained on the large-scale WebBody4M dataset. Our model achieves state-of-the-art performance on mainstream body ReID benchmarks and significantly outperforms specialized models in both short-term and long-term tracking scenarios, as well as in the newly introduced cross-pose–scale ReID task, establishing a new robust baseline for human identification.
📝 Abstract
Existing human recognition systems often rely on separate, specialized models for face and body analysis, limiting their effectiveness in real-world scenarios where pose, visibility, and context vary widely. This paper introduces SapiensID, a unified model that bridges this gap, achieving robust performance across diverse settings. SapiensID introduces (i) Retina Patch (RP), a dynamic patch generation scheme that adapts to subject scale and ensures consistent tokenization of regions of interest, (ii) a masked recognition model (MRM) that learns from variable token length, and (iii) Semantic Attention Head (SAH), an module that learns pose-invariant representations by pooling features around key body parts. To facilitate training, we introduce WebBody4M, a large-scale dataset capturing diverse poses and scale variations. Extensive experiments demonstrate that SapiensID achieves state-of-the-art results on various body ReID benchmarks, outperforming specialized models in both short-term and long-term scenarios while remaining competitive with dedicated face recognition systems. Furthermore, SapiensID establishes a strong baseline for the newly introduced challenge of Cross Pose-Scale ReID, demonstrating its ability to generalize to complex, real-world conditions.