ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension

📅 2024-06-17

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Current vision-language large models (VLMs) struggle with fine-grained vision-language alignment in referring expression comprehension and grounding, particularly suffering from limited accuracy in object localization and suboptimal inference efficiency. To address this, we propose the Token Collective mechanism and a discrete-continuous hybrid perception architecture: visual tokens are explicitly clustered to model semantic entities, eliminating reliance on geometric priors or proxy encodings. We introduce the first syntax-agnostic, unified autoregressive modeling framework—where both prompts and answers are processed without syntactic constraints—enabling joint referring expression parsing and pixel-level grounding. Our approach integrates a shared vision-language vocabulary, hybrid spatial perception, and multimodal autoregressive decoding. Experiments demonstrate substantial improvements over state-of-the-art MLLMs on referring comprehension and grounding benchmarks, achieving superior inference efficiency while supporting complex visual reasoning.

Technology Category

Application Category

📝 Abstract

Aligning vision and language concepts at a finer level remains an essential topic of multimodal large language models (MLLMs), particularly for tasks such as referring and grounding. Existing methods, such as proxy encoding and geometry encoding, incorporate additional syntax to encode spatial information, imposing extra burdens when communicating between language and vision modules. In this study, we propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens that collaboratively represent higher level semantics. A hybrid perception mechanism is also explored to perceive and understand scenes from both discrete and continuous spaces. Our method unifies the prompt and answer of visual referential tasks without using additional syntax. By leveraging a joint vision-language vocabulary, ClawMachine further integrates referring and grounding in an auto-regressive manner, demonstrating great potential with scaled-up pre-training data. Experiments show that ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency. It also exhibits the potential to integrate multi-source information for complex visual reasoning, which is beyond the capability of many MLLMs. Our code is available at github.com/martian422/ClawMachine.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Models

Visual-Linguistic Integration

Object Localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

ClawMachine

Multi-modal Large Model

Visual Tokenization

🔎 Similar Papers

No similar papers found.