TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction

πŸ“… 2024-05-27
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 12
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Conventional vision models struggle with accurate neuron segmentation in electron microscopy (EM) images characterized by high noise, anisotropy, and ultra-long-range spatial dependencies. Method: We propose TokenUnifyβ€”a novel hybrid token prediction framework that unifies random, next-token, and full-next-token prediction to theoretically suppress error accumulation in autoregressive pretraining. TokenUnify integrates the Mamba architecture with a large-scale EM image serialization scheme, enabling efficient modeling of ultra-long spatial sequences. Contribution/Results: We introduce the largest high-resolution EM neuron segmentation benchmark to date, comprising over 120 million annotated voxels. On downstream segmentation tasks, TokenUnify achieves a 45% mAP improvement over prior methods while significantly reducing computational complexity. It outperforms both masked autoencoders (MAE) and classical autoregressive approaches, marking the first successful alignment of vision and language pretraining paradigms for long-sequence modeling.

Technology Category

Application Category

πŸ“ Abstract
Autoregressive next-token prediction is a standard pretraining method for large-scale language models, but its application to vision tasks is hindered by the non-sequential nature of image data, leading to cumulative errors. Most vision models employ masked autoencoder (MAE) based pretraining, which faces scalability issues. To address these challenges, we introduce extbf{TokenUnify}, a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction. We provide theoretical evidence demonstrating that TokenUnify mitigates cumulative errors in visual autoregression. Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution, ideal for creating spatially correlated long sequences. This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date and providing a unified benchmark for experimental validation. Leveraging the Mamba network inherently suited for long-sequence modeling on this dataset, TokenUnify not only reduces the computational complexity but also leads to a significant 45% improvement in segmentation performance on downstream EM neuron segmentation tasks compared to existing methods. Furthermore, TokenUnify demonstrates superior scalability over MAE and traditional autoregressive methods, effectively bridging the gap between pretraining strategies for language and vision models. Code is available at url{https://github.com/ydchen0806/TokenUnify}.
Problem

Research questions and friction points this paper is trying to address.

Neuron segmentation from EM volumes faces complex structural challenges
High noise and anisotropic voxels require specialized vision models
Autoregressive pretraining adapts language model strategies for microscopy data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical predictive coding framework
Three complementary learning objectives
Linear-time Mamba architecture modeling
πŸ”Ž Similar Papers
No similar papers found.
Yinda Chen
Yinda Chen
University of Science and Technology of China, Xiamen University
Machine Learning TheorySelf-supervised LearningImage Compression
H
Haoyuan Shi
University of Science and Technology of China
X
Xiaoyu Liu
University of Science and Technology of China
T
Te Shi
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
R
Ruobing Zhang
Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences
D
Dong Liu
University of Science and Technology of China
Zhiwei Xiong
Zhiwei Xiong
University of Science and Technology of China
computational photographybiomedical image analysis
Feng Wu
Feng Wu
National University of Singapore
Mechine LearningMedical Time Series