Differentiable Hierarchical Visual Tokenization

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision Transformers rely on fixed-size image patches, neglecting intrinsic spatial structure and semantic hierarchy; moreover, existing approaches struggle to enable dynamic granularity adaptation while preserving compatibility with pretrained models. This paper proposes the first end-to-end differentiable hierarchical visual tokenization method: leveraging an information-theoretic criterion for model selection, it adaptively generates multi-granularity tokens directly from pixels, without requiring fine-tuning to integrate into standard Transformer architectures. The method is natively compatible with mainstream pretrained vision models, enabling plug-and-play upgrades and cross-modal raster-to-vector output generation. Experiments demonstrate state-of-the-art or competitive performance on both image classification and dense prediction tasks. To our knowledge, this is the first differentiable visual tokenization framework that simultaneously achieves semantic awareness, spatial fidelity, and architectural compatibility.

Technology Category

Application Category

📝 Abstract
Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.
Problem

Research questions and friction points this paper is trying to address.

Replaces fixed patch tokens with adaptive content-aware tokenization
Enables pixel-level granularity while maintaining architecture compatibility
Supports classification, dense prediction, and raster-to-vector conversion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable tokenizer adapts to image content granularity
Hierarchical model selection with information criteria
Backward-compatible with existing pretrained architectures
🔎 Similar Papers
Marius Aasan
Marius Aasan
University of Oslo
Machine LearningImagingProbabilistic Machine Learning
Martine Hjelkrem-Tan
Martine Hjelkrem-Tan
Research fellow, University of Oslo
Machine Learning
N
Nico Catalano
Polytechnic University of Milan, Artificial Intelligence and Robotics Lab
C
Changkyu Choi
UiT The Arctic University of Norway, Department of Physics and Technology
Adín Ramírez Rivera
Adín Ramírez Rivera
Professor, University of Oslo
Image ProcessingComputer VisionMachine Learning