🤖 AI Summary
Pedestrian attribute recognition (PAR) suffers from weak generalization due to scarce real labeled data and challenges posed by occlusion, pose variation, and complex environments.
Method: This paper pioneers a systematic investigation into image-to-image diffusion models for PAR-specific synthetic data generation. We propose a PAR-oriented prompt engineering framework coupled with joint image-attribute optimization, enabling precise control over textual prompts and attribute conditions to synthesize high-fidelity, diverse, and attribute-accurate pedestrian images.
Contribution/Results: The generated synthetic dataset significantly enhances the robustness of zero-shot PAR models, yielding a 4.5% average accuracy improvement across mainstream benchmarks—particularly improving performance under occlusion and cross-pose scenarios. Our approach establishes a reproducible, diffusion-based data augmentation paradigm for low-resource, fine-grained visual recognition tasks.
📝 Abstract
Pedestrian Attribute Recognition (PAR) involves identifying various human attributes from images with applications in intelligent monitoring systems. The scarcity of large-scale annotated datasets hinders the generalization of PAR models, specially in complex scenarios involving occlusions, varying poses, and diverse environments. Recent advances in diffusion models have shown promise for generating diverse and realistic synthetic images, allowing to expand the size and variability of training data. However, the potential of diffusion-based data expansion for generating PAR-like images remains underexplored. Such expansion may enhance the robustness and adaptability of PAR models in real-world scenarios. This paper investigates the effectiveness of diffusion models in generating synthetic pedestrian images tailored to PAR tasks. We identify key parameters of img2img diffusion-based data expansion; including text prompts, image properties, and the latest enhancements in diffusion-based data augmentation, and examine their impact on the quality of generated images for PAR. Furthermore, we employ the best-performing expansion approach to generate synthetic images for training PAR models, by enriching the zero-shot datasets. Experimental results show that prompt alignment and image properties are critical factors in image generation, with optimal selection leading to a 4.5% improvement in PAR recognition performance.