π€ AI Summary
Image-text models excel at image-level understanding but lack fine-grained spatial localization capability. To address this, we propose TextRegionβa novel framework that enables zero-shot, training-free, text-guided region token generation for the first time. TextRegion freezes off-the-shelf vision-language models (e.g., CLIP, BLIP) and integrates them with SAM2, leveraging gradient-free text-region feature alignment to produce pixel-accurate, open-vocabulary-compatible region tokens. The method is modular and model-agnostic, supporting seamless integration across diverse vision-language backbones. Evaluated on open-world semantic segmentation, referring expression comprehension, and visual grounding, TextRegion matches or surpasses state-of-the-art zero-shot approaches, significantly advancing fine-grained vision-language alignment without requiring finetuning or additional supervision.
π Abstract
Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.