🤖 AI Summary
To address the bottleneck of scarce annotated surgical image data hindering vision-based automation, this paper proposes a synthetic data generation method based on dynamic deformable 3D Gaussian reconstruction. Our approach introduces a novel dynamic Gaussian point lattice model that jointly captures both articulated motion of surgical instruments and non-rigid deformation of soft tissues, while incorporating ex-vivo tissue backgrounds to enhance anatomical realism. To ensure geometric fidelity, we integrate camera pose adaptation and an optimized training strategy for improved rendering quality and multi-view consistency. Evaluated on 14,000 frames from real surgical videos, our method achieves a PSNR of 29.87. When used to train downstream medical models, the synthesized data yields a 10% performance gain over conventional data augmentation techniques and an overall improvement of approximately 15% in key evaluation metrics.
📝 Abstract
Computer vision-based technologies significantly enhance surgical automation by advancing tool tracking, detection, and localization. However, Current data-driven approaches are data-voracious, requiring large, high-quality labeled image datasets, which limits their application in surgical data science. Our Work introduces a novel dynamic Gaussian Splatting technique to address the data scarcity in surgical image datasets. We propose a dynamic Gaussian model to represent dynamic surgical scenes, enabling the rendering of surgical instruments from unseen viewpoints and deformations with real tissue backgrounds. We utilize a dynamic training adjustment strategy to address challenges posed by poorly calibrated camera poses from real-world scenarios. Additionally, we propose a method based on dynamic Gaussians for automatically generating annotations for our synthetic data. For evaluation, we constructed a new dataset featuring seven scenes with 14,000 frames of tool and camera motion and tool jaw articulation, with a background of an ex-vivo porcine model. Using this dataset, we synthetically replicate the scene deformation from the ground truth data, allowing direct comparisons of synthetic image quality. Experimental results illustrate that our method generates photo-realistic labeled image datasets with the highest values in Peak-Signal-to-Noise Ratio (29.87). We further evaluate the performance of medical-specific neural networks trained on real and synthetic images using an unseen real-world image dataset. Our results show that the performance of models trained on synthetic images generated by the proposed method outperforms those trained with state-of-the-art standard data augmentation by 10%, leading to an overall improvement in model performances by nearly 15%.