๐ค AI Summary
This work addresses the challenge of in-context object localization that is training-free, category-agnostic, and grounded in visual evidence. Existing approaches often introduce semantic bias due to their reliance on explicit category supervision. To overcome this limitation, the authors propose a two-stage unsupervised framework: first, a contextual attention mechanism between support regions and query images directs the model to focus on visual correspondences rather than semantic priors; second, a Group Relative Policy Optimization strategy based on reinforcement learning directly minimizes localization error. This approach achieves robust instance-level localization without requiring category labelsโa first in the fieldโand significantly outperforms existing methods across multiple benchmarks. Notably, it surpasses a strong 72B-parameter baseline using only 7B parameters, demonstrating the effectiveness of the proposed objective function and architecture.
๐ Abstract
In-context localization (ICL) seeks to localize a target object specified by a small set of support examples in a query image, operating on the fly without training or parameter updates. Despite rapid advances in vision-language models (VLMs), achieving category-agnostic and visually grounded ICL remains an open problem, even though it is essential for applications such as image editing, personalized visual search, and retrieval. Existing methods are fragile and rely on explicit category supervision, which not only limits applicability in realistic settings with unnamed or instance-specific objects but also introduces category bias that steers predictions toward semantic priors rather than visual evidence. We introduce a two-stage training framework that explicitly optimizes in-context attention between support bounding boxes and query images without category supervision. We further refine localization via reinforcement learning using Group Relative Policy Optimization (GRPO) to directly minimize localization error. This formulation enforces visual correspondence over semantic priors, yielding robust instance-level localization. Empirically, a 7B-parameter model trained with our objectives outperforms models up to 72B parameters, demonstrating that context-aware localization objectives can surpass scaling alone. Comprehensive ablations validate the contribution of each component.