Differentiable Hierarchical Visual Tokenization

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Vision Transformers rely on fixed-size image patches, neglecting intrinsic spatial structure and semantic hierarchy; moreover, existing approaches struggle to enable dynamic granularity adaptation while preserving compatibility with pretrained models. This paper proposes the first end-to-end differentiable hierarchical visual tokenization method: leveraging an information-theoretic criterion for model selection, it adaptively generates multi-granularity tokens directly from pixels, without requiring fine-tuning to integrate into standard Transformer architectures. The method is natively compatible with mainstream pretrained vision models, enabling plug-and-play upgrades and cross-modal raster-to-vector output generation. Experiments demonstrate state-of-the-art or competitive performance on both image classification and dense prediction tasks. To our knowledge, this is the first differentiable visual tokenization framework that simultaneously achieves semantic awareness, spatial fidelity, and architectural compatibility.

Technology Category

Application Category

📝 Abstract

Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.

Problem

Research questions and friction points this paper is trying to address.

Replaces fixed patch tokens with adaptive content-aware tokenization

Enables pixel-level granularity while maintaining architecture compatibility

Supports classification, dense prediction, and raster-to-vector conversion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable tokenizer adapts to image content granularity

Hierarchical model selection with information criteria

Backward-compatible with existing pretrained architectures

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM