ALADIN:Attribute-Language Distillation Network for Person Re-Identification

📅 2026-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing CLIP-guided person re-identification methods, which rely on global features and fixed prompts, thereby struggling to capture fine-grained attributes and handle complex appearance variations. To overcome these challenges, we propose an attribute-language distillation framework that transfers knowledge from a frozen CLIP teacher model to a lightweight student network. Our approach employs a scene-aware soft prompt generator for adaptive image-text alignment and introduces an attribute-local alignment mechanism to enhance fine-grained representation. Leveraging structured attribute descriptions generated by multimodal large language models as supervision signals, the proposed method significantly outperforms CNN-, Transformer-, and CLIP-based baselines on Market-1501, DukeMTMC-reID, and MSMT17 benchmarks, demonstrating superior robustness to occlusion, enhanced generalization, and improved interpretability.

Technology Category

Application Category

📝 Abstract
Recent vision-language models such as CLIP provide strong cross-modal alignment, but current CLIP-guided ReID pipelines rely on global features and fixed prompts. This limits their ability to capture fine-grained attribute cues and adapt to diverse appearances. We propose ALADIN, an attribute-language distillation network that distills knowledge from a frozen CLIP teacher to a lightweight ReID student. ALADIN introduces fine-grained attribute-local alignment to establish adaptive text-visual correspondence and robust representation learning. A Scene-Aware Prompt Generator produces image-specific soft prompts to facilitate adaptive alignment. Attribute-local distillation enforces consistency between textual attributes and local visual features, significantly enhancing robustness under occlusions. Furthermore, we employ cross-modal contrastive and relation distillation to preserve the inherent structural relationships among attributes. To provide precise supervision, we leverage Multimodal LLMs to generate structured attribute descriptions, which are then converted into localized attention maps via CLIP. At inference, only the student is used. Experiments on Market-1501, DukeMTMC-reID, and MSMT17 show improvements over CNN-, Transformer-, and CLIP-based methods, with better generalization and interpretability.
Problem

Research questions and friction points this paper is trying to address.

person re-identification
fine-grained attributes
vision-language alignment
occlusion robustness
adaptive prompting
Innovation

Methods, ideas, or system contributions that make the work stand out.

attribute-local alignment
scene-aware prompt generation
cross-modal distillation
multimodal LLM supervision
person re-identification
🔎 Similar Papers
No similar papers found.
Wang Zhou
Wang Zhou
Sun Yat-Sen University
B
Boran Duan
Wuhan University No.299 Bayi Road, 430072, PR China
H
Haojun Ai
Wuhan University No.299 Bayi Road, 430072, PR China
R
Ruiqi Lan
Wuhan University No.299 Bayi Road, 430072, PR China
Z
Ziyue Zhou
Wuhan University No.299 Bayi Road, 430072, PR China