VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Pedestrian attribute recognition (PAR) faces three key challenges: severe class imbalance in fine-grained attribute prediction, strong inter-attribute correlations, and poor cross-domain generalization. To address these, we propose a modular vision-language framework built upon a frozen multilingual SigLIP encoder. First, both the visual and textual encoders of SigLIP are frozen to preserve rich multilingual semantic priors. Second, a lightweight cross-modal cross-attention mechanism is introduced to precisely align image features with learnable prompt embeddings. Third, a prompt-driven visual feature refinement module is incorporated to enhance discriminability and mitigate distribution shift across domains. Evaluated on PA100K, PETA, and Market-1501, our method achieves state-of-the-art performance on PA100K and yields substantial average accuracy gains on PETA and Market-1501. It effectively alleviates class imbalance and domain discrepancy, demonstrating superior robustness and generalization capability.

Technology Category

Application Category

📝 Abstract

Pedestrian Attribute Recognition (PAR) involves predicting fine-grained attributes such as clothing color, gender, and accessories from pedestrian imagery, yet is hindered by severe class imbalance, intricate attribute co-dependencies, and domain shifts. We introduce VLM-PAR, a modular vision-language framework built on frozen SigLIP 2 multilingual encoders. By first aligning image and prompt embeddings via refining visual features through a compact cross-attention fusion, VLM-PAR achieves significant accuracy improvement on the highly imbalanced PA100K benchmark, setting a new state-of-the-art performance, while also delivering significant gains in mean accuracy across PETA and Market-1501 benchmarks. These results underscore the efficacy of integrating large-scale vision-language pretraining with targeted cross-modal refinement to overcome imbalance and generalization challenges in PAR.

Problem

Research questions and friction points this paper is trying to address.

Addresses severe class imbalance in pedestrian attribute recognition

Overcomes intricate attribute co-dependencies in fine-grained prediction

Mitigates domain shifts to improve generalization across datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language model for attribute recognition

Cross-attention fusion refines visual features

Frozen multilingual encoders enhance generalization

🔎 Similar Papers

No similar papers found.