StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the limitations of existing token merging methods on the Segment Anything Model (SAM), which often degrade boundary details and leak prompt information, struggling to balance efficiency and accuracy. The authors propose StructSAM, a structure- and spectrum-preserving token merging-and-recovery framework tailored for SAM that achieves substantial computational savings without retraining. Its key innovations include an energy scoring mechanism based on first-order gradients combined with a grid flatness criterion to safeguard boundaries and prompt-sensitive regions, and—building on spectral graph coarsening theory—the first analysis of how token merging affects feature structure, ensuring bounded spectral distortion. Evaluated across eight natural and medical image benchmarks, StructSAM reduces encoder FLOPs by 25–30% (over 40% in prompt-aware settings) with only marginal drops in mIoU/Dice, significantly outperforming methods like ToMe and PiToMe.

Technology Category

Application Category

📝 Abstract

Recent token merging techniques for Vision Transformers (ViTs) provide substantial speedups by reducing the number of tokens processed by self-attention, often without retraining. However, their direct application to the Segment Anything Model (SAM) family is nontrivial: SAM's image encoder mixes windowed and global attention, and its mask decoder relies on dense, prompt-conditioned features for precise boundary prediction. We systematically evaluate representative token-merging methods on SAM and Medical SAM in a strict off-the-shelf setting, and find that existing destination-selection heuristics can erode boundaries and leak prompt information as merge rates increase. We propose \textbf{StructSAM}, a resolution-preserving merge-unmerge framework tailored to SAM. StructSAM computes a lightweight token-energy score from first-order feature gradients, uses grid-based flatness screening to protect boundary and prompt regions, and merges tokens within flat areas toward low-energy destinations with explicit token recovery. We further provide a spectral graph coarsening view showing that score-guided merging yields bounded Laplacian spectral distortion compared to random or window-restricted baselines. Across eight natural and medical benchmarks, StructSAM reduces encoder FLOPs by 25-30\% (up to 40\%+ with prompt-aware merging) with minor drops in mIoU/Dice, consistently outperforming ToMe, PiToMe, ToMeSD, VidToMe, and ALGM at the same compute.

Problem

Research questions and friction points this paper is trying to address.

token merging

Segment Anything Model

boundary preservation

prompt leakage

computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

token merging

Segment Anything Model

structure preservation