🤖 AI Summary
In laparoscopic liver surgery, conventional 2D video impedes depth perception and compromises anatomical landmark localization accuracy. To address this, we propose a depth-guided anatomical landmark segmentation framework. Our method employs a dual-encoder architecture: SAM2 processes RGB inputs while DA2 extracts depth features; a cross-attention fusion module integrates semantic and geometric cues synergistically; and SRFT-GaLore—a low-rank optimization technique substituting SVD with subsampled random Fourier transforms—significantly reduces fine-tuning overhead for large foundation models. Evaluated on the L3D dataset, our approach achieves a 4.85% Dice score improvement and a 11.78% reduction in average symmetric surface distance. On the newly introduced LLSD dataset, it substantially outperforms SAM-based baselines, demonstrating strong cross-domain generalization and real-time intraoperative adaptability.
📝 Abstract
Accurate detection and delineation of anatomical structures in medical imaging are critical for computer-assisted interventions, particularly in laparoscopic liver surgery where 2D video streams limit depth perception and complicate landmark localization. While recent works have leveraged monocular depth cues for enhanced landmark detection, challenges remain in fusing RGB and depth features and in efficiently adapting large-scale vision models to surgical domains. We propose a depth-guided liver landmark segmentation framework integrating semantic and geometric cues via vision foundation encoders. We employ Segment Anything Model V2 (SAM2) encoder to extract RGB features and Depth Anything V2 (DA2) encoder to extract depth-aware features. To efficiently adapt SAM2, we introduce SRFT-GaLore, a novel low-rank gradient projection method that replaces the computationally expensive SVD with a Subsampled Randomized Fourier Transform (SRFT). This enables efficient fine-tuning of high-dimensional attention layers without sacrificing representational power. A cross-attention fusion module further integrates RGB and depth cues. To assess cross-dataset generalization, we also construct a new Laparoscopic Liver Surgical Dataset (LLSD) as an external validation benchmark. On the public L3D dataset, our method achieves a 4.85% improvement in Dice Similarity Coefficient and a 11.78-point reduction in Average Symmetric Surface Distance compared to the D2GPLand. To further assess generalization capability, we evaluate our model on LLSD dataset. Our model maintains competitive performance and significantly outperforms SAM-based baselines, demonstrating strong cross-dataset robustness and adaptability to unseen surgical environments. These results demonstrate that our SRFT-GaLore-enhanced dual-encoder framework enables scalable and precise segmentation under real-time, depth-constrained surgical settings.