🤖 AI Summary
To address the challenge of real-time, accurate tracking of interventional devices (catheters, balloons, stents) in low-contrast, artifact-corrupted X-ray fluoroscopy videos with scarce annotations, this paper proposes an auxiliary-cue-driven self-supervised feature learning framework. Methodologically, it integrates multi-scale optical-flow-guided contrastive pretraining, decoupled feature distillation, and a dynamic attention-based tracking head—enabling implicit modeling of device morphology and motion priors without manual labels. The framework significantly enhances cross-view and cross-device generalization. Evaluated on clinical X-ray sequences, it achieves a mean Area-over-Recall (AOR) of 92.3%, outperforming the state-of-the-art by 7.8 percentage points, while maintaining real-time inference at 36 FPS. Its core contribution is the first unsupervised modeling mechanism for device motion priors, effectively overcoming robustness bottlenecks in scenarios involving small targets, multiple co-occurring devices, and severe imaging artifacts.