ROGLE: Robust Global-Local Alignment with Automated Region Supervision for Text-Based Person Search

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the limitations of existing CLIP-based text-based person search methods, which suffer from global representation bias and semantic sparsity, hindering fine-grained alignment and lacking region-level supervision. To overcome these challenges, the authors propose ROGLE, a novel framework that introduces an automatic Region-Sentence Matching (RSM) mechanism to generate pseudo-labels without manual annotation, enabling joint optimization of global contrastive learning and local region alignment. Additionally, they construct P-VLG, the first large-scale benchmark for Text-Based Person Search that supports multi-granularity evaluation. Extensive experiments demonstrate that ROGLE significantly outperforms state-of-the-art methods across multiple benchmarks, particularly excelling in complex scenarios involving long textual queries. The code and the P-VLG dataset will be publicly released.

📝 Abstract

Text-Based Person Search (TBPS) aims to retrieve pedestrian images using natural language queries. However, existing TBPS models, especially those based on CLIP, struggle with fine-grained understanding due to global representational bias and semantic sparsity inherited from training on short captions. This results in weak fine-grained alignment, exacerbated by the scarcity of region-level annotations. To address this, we propose ROGLE (Robust Global-Local Embedding), a unified framework that overcomes reliance on costly manual annotations through an automated Region-to-Sentence Matching (RSM) strategy. RSM automatically mines pseudo region-sentence pairs for scalable fine-grained supervision. Furthermore, ROGLE employs a multi-granular learning strategy that fuses global contrastive learning with region-level local alignment. We also introduce the P-VLG Benchmark, a large-scale dataset constructed by curating and enriching images from established public benchmarks. It features over 100,000 annotated regions and rich long-form captions, making it the first TBPS benchmark to support both global and local assessment protocols. Extensive experiments show that ROGLE significantly outperforms existing approaches, particularly on challenging long-form queries. Code and the P-VLG benchmark will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Text-Based Person Search

fine-grained alignment

region-level annotations

semantic sparsity

global representational bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

Region-to-Sentence Matching

Global-Local Alignment

Text-Based Person Search