TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

πŸ“… 2025-05-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Image-text models excel at image-level understanding but lack fine-grained spatial localization capability. To address this, we propose TextRegionβ€”a novel framework that enables zero-shot, training-free, text-guided region token generation for the first time. TextRegion freezes off-the-shelf vision-language models (e.g., CLIP, BLIP) and integrates them with SAM2, leveraging gradient-free text-region feature alignment to produce pixel-accurate, open-vocabulary-compatible region tokens. The method is modular and model-agnostic, supporting seamless integration across diverse vision-language backbones. Evaluated on open-world semantic segmentation, referring expression comprehension, and visual grounding, TextRegion matches or surpasses state-of-the-art zero-shot approaches, significantly advancing fine-grained vision-language alignment without requiring finetuning or additional supervision.

Technology Category

Application Category

πŸ“ Abstract
Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.
Problem

Research questions and friction points this paper is trying to address.

Combines image-text models and SAM2 for detailed visual understanding
Generates text-aligned region tokens for open-vocabulary tasks
Enables training-free application to semantic segmentation and grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines image-text models with SAM2
Generates text-aligned region tokens
Training-free, open-vocabulary framework
πŸ”Ž Similar Papers
No similar papers found.
Y
Yao Xiao
University of Illinois at Urbana-Champaign
Q
Qiqian Fu
University of Illinois at Urbana-Champaign
H
Heyi Tao
University of Illinois at Urbana-Champaign
Y
Yuqun Wu
University of Illinois at Urbana-Champaign
Zhen Zhu
Zhen Zhu
University of Illinois at Urbana-Champaign
Computer VisionDeep Learning
Derek Hoiem
Derek Hoiem
Professor of Computer Science, University of Illinois