DisentangleFormer: Spatial-Channel Decoupling for Multi-Channel Vision

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Vision Transformers suffer from spatial-channel feature entanglement in multi-channel vision tasks, hindering independent modeling of structural and semantic dependencies—particularly detrimental in channel-sensitive domains such as hyperspectral remote sensing and infrared histopathological imaging. To address this, we propose DisentangleFormer: the first Vision Transformer architecture incorporating an information-theoretic, spatial-channel parallel token stream for dual-path disentangled representation learning. It introduces a Squeeze-Token Enhancer to boost channel-wise discriminability and integrates multi-scale feed-forward networks with hybrid local-global attention for synergistic modeling. Evaluated on multiple hyperspectral and infrared histopathology benchmarks, DisentangleFormer achieves state-of-the-art performance. On ImageNet, it reduces computational cost by 17.8% while maintaining competitive accuracy. The framework establishes an interpretable, efficient, and general-purpose disentanglement paradigm for multi-channel vision Transformers.

Technology Category

Application Category

📝 Abstract

Vision Transformers face a fundamental limitation: standard self-attention jointly processes spatial and channel dimensions, leading to entangled representations that prevent independent modeling of structural and semantic dependencies. This problem is especially pronounced in hyperspectral imaging, from satellite hyperspectral remote sensing to infrared pathology imaging, where channels capture distinct biophysical or biochemical cues. We propose DisentangleFormer, an architecture that achieves robust multi-channel vision representation through principled spatial-channel decoupling. Motivated by information-theoretic principles of decorrelated representation learning, our parallel design enables independent modeling of structural and semantic cues while minimizing redundancy between spatial and channel streams. Our design integrates three core components: (1) Parallel Disentanglement: Independently processes spatial-token and channel-token streams, enabling decorrelated feature learning across spatial and spectral dimensions, (2) Squeezed Token Enhancer: An adaptive calibration module that dynamically fuses spatial and channel streams, and (3) Multi-Scale FFN: complementing global attention with multi-scale local context to capture fine-grained structural and semantic dependencies. Extensive experiments on hyperspectral benchmarks demonstrate that DisentangleFormer achieves state-of-the-art performance, consistently outperforming existing models on Indian Pine, Pavia University, and Houston, the large-scale BigEarthNet remote sensing dataset, as well as an infrared pathology dataset. Moreover, it retains competitive accuracy on ImageNet while reducing computational cost by 17.8% in FLOPs. The code will be made publicly available upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Decouples spatial and channel dimensions in vision transformers

Enables independent modeling of structural and semantic dependencies

Improves multi-channel vision tasks like hyperspectral imaging

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel spatial-channel decoupling for disentangled representation

Squeezed Token Enhancer for adaptive stream fusion

Multi-Scale FFN to capture local and global dependencies

🔎 Similar Papers

Unmixing Optical Signals from Undersampled Volumetric Measurements by Filtering the Pixel Latent Variables