🤖 AI Summary
This work addresses the challenges posed by heterogeneous channel configurations in multi-channel imaging—arising from variations in staining protocols, sensors, and acquisition settings—which hinder the generalization of fixed-channel encoders and dilute channel-specific semantics in existing cross-channel interaction methods. To overcome these limitations, we propose the Decoupled Vision Transformer (DC-ViT), the first multi-channel ViT framework to explicitly decouple spatial and channel interactions. Our approach introduces Decoupled Self-Attention (DSA), which separates token updates into a spatial path for modeling intra-channel structure and a channel path for adaptively fusing cross-channel information, complemented by a Decoupled Aggregation Module (DAG) that learns task-dependent channel importance. Experiments on three multi-channel benchmarks demonstrate that DC-ViT significantly outperforms existing MC-ViT methods, achieving enhanced robustness and representational capacity for heterogeneous channel inputs.
📝 Abstract
Training and evaluation in multi-channel imaging (MCI) remains challenging due to heterogeneous channel configurations arising from varying staining protocols, sensor types, and acquisition settings. This heterogeneity limits the applicability of fixed-channel encoders commonly used in general computer vision. Recent Multi-Channel Vision Transformers (MC-ViTs) address this by enabling flexible channel inputs, typically by jointly encoding patch tokens from all channels within a unified attention space. However, unrestricted token interactions across channels can lead to feature dilution, reducing the ability to preserve channel-specific semantics that are critical in MCI data. To address this, we propose Decoupled Vision Transformer (DC-ViT), which explicitly regulates information sharing using Decoupled Self-Attention (DSA), which decomposes token updates into two complementary pathways: spatial updates that model intra-channel structure, and channel-wise updates that adaptively integrate cross-channel information. This decoupling mitigates informational collapse while allowing selective inter-channel interaction. To further exploit these enhanced channel-specific representations, we introduce Decoupled Aggregation (DAG), which allows the model to learn task-specific channel importances. Extensive experiments across three MCI benchmarks demonstrate consistent improvements over existing MC-ViT approaches.