🤖 AI Summary
Segment Anything Model (SAM) suffers from inaccurate boundary localization in fine-grained endoscopic instance segmentation due to its reliance on sparse prompts (e.g., points or bounding boxes), which inadequately encode object shape priors.
Method: We propose an extremity-point interaction paradigm, using the leftmost, rightmost, topmost, and bottommost points of the target as structured prompts—replacing conventional sparse prompts. We design an extremity-point semantic embedding mechanism and a prompt-only Canvas auxiliary training task to explicitly model the mapping between extremity-point spatial configurations and corresponding mask distributions. Within the SAM architecture, we introduce learnable extremity-point embeddings, a Canvas prompt encoder, and a vision-free mask prior prediction module.
Results: Our method achieves state-of-the-art performance across three endoscopic surgery datasets, outperforming all existing SAM adaptation approaches. A human-factor study demonstrates a 37% improvement in annotation efficiency over bounding-box prompting, alongside higher segmentation accuracy.
📝 Abstract
The Segment Anything Model (SAM) has revolutionized open-set interactive image segmentation, inspiring numerous adapters for the medical domain. However, SAM primarily relies on sparse prompts such as point or bounding box, which may be suboptimal for fine-grained instance segmentation, particularly in endoscopic imagery, where precise localization is critical and existing prompts struggle to capture object boundaries effectively. To address this, we introduce S4M (Segment Anything with 4 Extreme Points), which augments SAM by leveraging extreme points -- the top-, bottom-, left-, and right-most points of an instance -- prompts. These points are intuitive to identify and provide a faster, structured alternative to box prompts. However, a na""ive use of extreme points degrades performance, due to SAM's inability to interpret their semantic roles. To resolve this, we introduce dedicated learnable embeddings, enabling the model to distinguish extreme points from generic free-form points and better reason about their spatial relationships. We further propose an auxiliary training task through the Canvas module, which operates solely on prompts -- without vision input -- to predict a coarse instance mask. This encourages the model to internalize the relationship between extreme points and mask distributions, leading to more robust segmentation. S4M outperforms other SAM-based approaches on three endoscopic surgical datasets, demonstrating its effectiveness in complex scenarios. Finally, we validate our approach through a human annotation study on surgical endoscopic videos, confirming that extreme points are faster to acquire than bounding boxes.