TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Image-text models excel at image-level understanding but lack fine-grained spatial localization capability. To address this, we propose TextRegion—a novel framework that enables zero-shot, training-free, text-guided region token generation for the first time. TextRegion freezes off-the-shelf vision-language models (e.g., CLIP, BLIP) and integrates them with SAM2, leveraging gradient-free text-region feature alignment to produce pixel-accurate, open-vocabulary-compatible region tokens. The method is modular and model-agnostic, supporting seamless integration across diverse vision-language backbones. Evaluated on open-world semantic segmentation, referring expression comprehension, and visual grounding, TextRegion matches or surpasses state-of-the-art zero-shot approaches, significantly advancing fine-grained vision-language alignment without requiring finetuning or additional supervision.

Technology Category

Application Category

📝 Abstract

Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.

Problem

Research questions and friction points this paper is trying to address.

Combines image-text models and SAM2 for detailed visual understanding

Generates text-aligned region tokens for open-vocabulary tasks

Enables training-free application to semantic segmentation and grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines image-text models with SAM2

Generates text-aligned region tokens

Training-free, open-vocabulary framework

🔎 Similar Papers

No similar papers found.