🤖 AI Summary
This work addresses the limitations of existing token merging methods on the Segment Anything Model (SAM), which often degrade boundary details and leak prompt information, struggling to balance efficiency and accuracy. The authors propose StructSAM, a structure- and spectrum-preserving token merging-and-recovery framework tailored for SAM that achieves substantial computational savings without retraining. Its key innovations include an energy scoring mechanism based on first-order gradients combined with a grid flatness criterion to safeguard boundaries and prompt-sensitive regions, and—building on spectral graph coarsening theory—the first analysis of how token merging affects feature structure, ensuring bounded spectral distortion. Evaluated across eight natural and medical image benchmarks, StructSAM reduces encoder FLOPs by 25–30% (over 40% in prompt-aware settings) with only marginal drops in mIoU/Dice, significantly outperforming methods like ToMe and PiToMe.
📝 Abstract
Recent token merging techniques for Vision Transformers (ViTs) provide substantial speedups by reducing the number of tokens processed by self-attention, often without retraining. However, their direct application to the Segment Anything Model (SAM) family is nontrivial: SAM's image encoder mixes windowed and global attention, and its mask decoder relies on dense, prompt-conditioned features for precise boundary prediction. We systematically evaluate representative token-merging methods on SAM and Medical SAM in a strict off-the-shelf setting, and find that existing destination-selection heuristics can erode boundaries and leak prompt information as merge rates increase. We propose \textbf{StructSAM}, a resolution-preserving merge-unmerge framework tailored to SAM. StructSAM computes a lightweight token-energy score from first-order feature gradients, uses grid-based flatness screening to protect boundary and prompt regions, and merges tokens within flat areas toward low-energy destinations with explicit token recovery. We further provide a spectral graph coarsening view showing that score-guided merging yields bounded Laplacian spectral distortion compared to random or window-restricted baselines. Across eight natural and medical benchmarks, StructSAM reduces encoder FLOPs by 25-30\% (up to 40\%+ with prompt-aware merging) with minor drops in mIoU/Dice, consistently outperforming ToMe, PiToMe, ToMeSD, VidToMe, and ALGM at the same compute.