🤖 AI Summary
Vision Transformers rely on fixed-size image patches, neglecting intrinsic spatial structure and semantic hierarchy; moreover, existing approaches struggle to enable dynamic granularity adaptation while preserving compatibility with pretrained models. This paper proposes the first end-to-end differentiable hierarchical visual tokenization method: leveraging an information-theoretic criterion for model selection, it adaptively generates multi-granularity tokens directly from pixels, without requiring fine-tuning to integrate into standard Transformer architectures. The method is natively compatible with mainstream pretrained vision models, enabling plug-and-play upgrades and cross-modal raster-to-vector output generation. Experiments demonstrate state-of-the-art or competitive performance on both image classification and dense prediction tasks. To our knowledge, this is the first differentiable visual tokenization framework that simultaneously achieves semantic awareness, spatial fidelity, and architectural compatibility.
📝 Abstract
Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.