π€ AI Summary
To address the dual challenges of performance degradation and confidence miscalibration under distribution shift in foundation models (e.g., CLIP, SAM), this paper proposes StaRFMβa unified robust framework. Methodologically, StaRFM is the first to jointly integrate Fisher Information Penalty (FIP) into regularization and voxel-/patch-level confidence calibration loss (CMP), supporting both 2D/3D vision and medical imaging tasks. Theoretically, it derives a PAC-Bayes generalization bound and optimizes the Brier score; practically, it enables plug-and-play deployment. Evaluated on 19 diverse vision benchmarks, StaRFM achieves an average accuracy gain of 3.5% and reduces Expected Calibration Error (ECE) by 28%. In medical image segmentation, it attains 84.7% Dice score and 4.8 mm HD95, narrowing cross-domain performance gaps by 40%. These results demonstrate substantial improvements in model generalization and uncertainty calibration.
π Abstract
Foundation models like CLIP and SAM have transformed computer vision and medical imaging via low-shot transfer learning. However, deployment of these models hindered by two key challenges: extit{distribution shift} between training and test data, and extit{confidence misalignment} that leads to overconfident incorrect predictions. These issues manifest differently in vision-language classification and medical segmentation tasks, yet existing solutions remain domain-specific. We propose extit{StaRFM}, a unified framework addressing both challenges. It introduces a Fisher information penalty (FIP), extended to 3D medical data via patch-wise regularization, to reduce covariate shift in CLIP and SAM embeddings. Additionally, a confidence misalignment penalty (CMP), reformulated for voxel-level predictions, calibrates uncertainty in segmentation tasks. We theoretically derive PAC-Bayes bounds showing FIP controls generalization via the Fisher-Rao norm, while CMP minimizes calibration error through Brier score optimization. StaRFM shows consistent performance like exttt{+}3.5% accuracy and 28% lower ECE on 19 vision datasets (e.g., ImageNet, Office-Home), 84.7% DSC and 4.8mm HD95 in medical segmentation (e.g., BraTS, ATLAS), and 40% lower cross-domain performance gap compared to prior benchmarking methods. The framework is plug-and-play, requiring minimal architectural changes for seamless integration with foundation models. Code and models will be released at https://anonymous.4open.science/r/StaRFM-C0CD/README.md