🤖 AI Summary
This work addresses the challenge of identifying and efficiently utilizing salient image patches in image modeling. We propose a novel “patch collapse” paradigm, inspired by quantum state collapse, wherein reducing patch uncertainty reveals an optimal perceptual ordering through inter-patch dependency modeling. Methodologically, we construct a dependency graph over patches and jointly leverage graph neural networks and PageRank to quantify patch importance; a soft selection and reconstruction mechanism is implemented via an autoencoder. Our key contributions are: (i) the first introduction of a quantum-inspired collapse perspective into patch-based vision modeling, and (ii) a structured, dependency-driven dynamic patch ranking mechanism. Experiments demonstrate substantial improvements in masked autoregressive (MAR) image generation performance. Moreover, using only the top 22% highest-rank patches, our method achieves state-of-the-art accuracy on image classification—validating its dual advantages in modeling efficiency and representational compactness.
📝 Abstract
Observing certain patches in an image reduces the uncertainty of others. Their realization lowers the distribution entropy of each remaining patch feature, analogous to collapsing a particle's wave function in quantum mechanics. This phenomenon can intuitively be called patch collapse. To identify which patches are most relied on during a target region's collapse, we learn an autoencoder that softly selects a subset of patches to reconstruct each target patch. Graphing these learned dependencies for each patch's PageRank score reveals the optimal patch order to realize an image. We show that respecting this order benefits various masked image modeling methods. First, autoregressive image generation can be boosted by retraining the state-of-the-art model MAR. Next, we introduce a new setup for image classification by exposing Vision Transformers only to high-rank patches in the collapse order. Seeing 22% of such patches is sufficient to achieve high accuracy. With these experiments, we propose patch collapse as a novel image modeling perspective that promotes vision efficiency. Our project is available at https://github.com/wguo-ai/CoP .