Talking Points: Describing and Localizing Pixels

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Current vision-language models lack the capability for pixel-level keypoint localization, supporting only coarse-grained object- or region-level understanding. To address this limitation, we propose the first pixel-level vision-language localization framework, which jointly generates free-form, coarse-to-fine contextual descriptions of keypoints via a Point Descriptor and regresses precise pixel coordinates via a Point Localizer—enabling bidirectional natural-language referencing and interpretation of arbitrary image pixels. We introduce GRPO (Gradient-Regularized Policy Optimization), a novel training strategy, and train our model on the synthetic dataset LlamaPointInPart. Our approach significantly outperforms existing baselines. Furthermore, we design a dedicated localizability evaluation protocol that quantitatively verifies strong consistency between generated descriptions and predicted coordinates. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract

Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. We introduce a novel framework for pixel level grounding. The framework consists of two complementary components: a Point Descriptor that generates rich, contextual descriptions of individual keypoints, and a Point Localizer that regresses precise pixel coordinates from these descriptions. Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context. Since there is no available dataset to train such a system, we introduce LlamaPointInPart, a carefully curated dataset of 20K+ image-keypoint-description triplets synthesized from multiple vision-language models, capturing multi-scale information from scene-level context to visual features around the keypoint. For cross-category generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the frozen Point Localizer as a reward model to produce descriptions that maximize localization accuracy. To evaluate our results we establish a new evaluation protocol. Instead of comparing the text description produced by our method to the ground truth, we use the localizer to determine how close is the predicted point generated to the ground truth point. Experiments demonstrate superior performance compared to baseline models on LlamaPointInPart.The bidirectional nature of our framework should enable future applications in both keypoint-guided image understanding and language-guided precise localization. Our code and dataset are publicly available at https://github.com/matanr/Talking_Points.

Problem

Research questions and friction points this paper is trying to address.

Achieving pixel-precise keypoint grounding through natural language

Generating free-form descriptions to localize keypoints in visual context

Enabling cross-category generalization for keypoint localization tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pixel-level grounding with Point Descriptor and Localizer

Free-form coarse-to-fine descriptions for keypoint localization

Cross-category generalization via GRPO optimization on AP-10K

🔎 Similar Papers

No similar papers found.