ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension

📅 2024-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language large models (VLMs) struggle with fine-grained vision-language alignment in referring expression comprehension and grounding, particularly suffering from limited accuracy in object localization and suboptimal inference efficiency. To address this, we propose the Token Collective mechanism and a discrete-continuous hybrid perception architecture: visual tokens are explicitly clustered to model semantic entities, eliminating reliance on geometric priors or proxy encodings. We introduce the first syntax-agnostic, unified autoregressive modeling framework—where both prompts and answers are processed without syntactic constraints—enabling joint referring expression parsing and pixel-level grounding. Our approach integrates a shared vision-language vocabulary, hybrid spatial perception, and multimodal autoregressive decoding. Experiments demonstrate substantial improvements over state-of-the-art MLLMs on referring comprehension and grounding benchmarks, achieving superior inference efficiency while supporting complex visual reasoning.

Technology Category

Application Category

📝 Abstract
Aligning vision and language concepts at a finer level remains an essential topic of multimodal large language models (MLLMs), particularly for tasks such as referring and grounding. Existing methods, such as proxy encoding and geometry encoding, incorporate additional syntax to encode spatial information, imposing extra burdens when communicating between language and vision modules. In this study, we propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens that collaboratively represent higher level semantics. A hybrid perception mechanism is also explored to perceive and understand scenes from both discrete and continuous spaces. Our method unifies the prompt and answer of visual referential tasks without using additional syntax. By leveraging a joint vision-language vocabulary, ClawMachine further integrates referring and grounding in an auto-regressive manner, demonstrating great potential with scaled-up pre-training data. Experiments show that ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency. It also exhibits the potential to integrate multi-source information for complex visual reasoning, which is beyond the capability of many MLLMs. Our code is available at github.com/martian422/ClawMachine.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Models
Visual-Linguistic Integration
Object Localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

ClawMachine
Multi-modal Large Model
Visual Tokenization
🔎 Similar Papers
No similar papers found.
T
Tianren Ma
University of Chinese Academy of Sciences
L
Lingxi Xie
University of Chinese Academy of Sciences
Yunjie Tian
Yunjie Tian
University at Buffalo, UCAS
Computer visionMultimodal learning
B
Boyu Yang
China Mobile Research Institute
Qixiang Ye
Qixiang Ye
University of Chinese Academy of Sciences, University of Maryland
Visual Object DetectionImage Processing