🤖 AI Summary
Current hierarchical visual perception (HVP) models suffer from poor generalization and difficulty in edge deployment: single-objective pretraining limits cross-task transferability, while excessive model size hinders deployment on resource-constrained devices. To address these challenges, we propose a cross-scale consistency-driven self-supervised pretraining framework with three complementary objectives—Contrastive Scale Modeling (CSM), Cross-Scale Reconstruction (CSR), and Cross-Scale Retrieval (CSS). Our approach pioneers a unified learning paradigm integrating contrastive learning, masked image modeling, and cross-scale retrieval. We further design scalable single- and multi-person data construction strategies to enable lightweight models to jointly capture multi-scale generic visual patterns. Evaluated across 12 HVP benchmarks and 9 diverse tasks, our method achieves state-of-the-art performance: +3–13% improvement in single-person discriminative tasks, +1–11% in dense prediction, and +1–6% in multi-person understanding—demonstrating substantially enhanced generalization and edge deployability.
📝 Abstract
Human-centric visual perception (HVP) has recently achieved remarkable progress due to advancements in large-scale self-supervised pretraining (SSP). However, existing HVP models face limitations in adapting to real-world applications, which require general visual patterns for downstream tasks while maintaining computationally sustainable costs to ensure compatibility with edge devices. These limitations primarily arise from two issues: 1) the pretraining objectives focus solely on specific visual patterns, limiting the generalizability of the learned patterns for diverse downstream tasks; and 2) HVP models often exhibit excessively large model sizes, making them incompatible with real-world applications. To address these limitations, we introduce Scale-Aware Image Pretraining (SAIP), a novel SSP framework enabling lightweight vision models to acquire general patterns for HVP. Specifically, SAIP incorporates three learning objectives based on the principle of cross-scale consistency: 1) Cross-scale Matching (CSM) which contrastively learns image-level invariant patterns from multi-scale single-person images; 2) Cross-scale Reconstruction (CSR) which learns pixel-level consistent visual structures from multi-scale masked single-person images; and 3) Cross-scale Search (CSS) which learns to capture diverse patterns from multi-scale multi-person images. Three objectives complement one another, enabling lightweight models to learn multi-scale generalizable patterns essential for HVP downstream tasks.Extensive experiments conducted across 12 HVP datasets demonstrate that SAIP exhibits remarkable generalization capabilities across 9 human-centric vision tasks. Moreover, it achieves significant performance improvements over existing methods, with gains of 3%-13% in single-person discrimination tasks, 1%-11% in dense prediction tasks, and 1%-6% in multi-person visual understanding tasks.