Frozen Foundation-Model Embeddings Discard Small-Lesion Signal in Chest Radiography: Implications for Pre-Deployment Evaluation

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the critical issue that frozen vision transformers (ViTs) inadvertently suppress signals from small, low-contrast lesions in chest X-rays when using standard global pooling strategies—such as CLS token or mean pooling—leading to severely degraded downstream detection performance. Through systematic evaluation of five frozen ViT variants (including RAD-DINO, DINOv2, and BiomedCLIP) alongside ResNet-50 on large-scale datasets like NIH-CXR14 and MIMIC-CXR, the authors employ lesion-localized bounding boxes to extract region-restricted embeddings and assess performance via AUC. They demonstrate for the first time that conventional pooling reduces small lesion detection to near-random levels (AUC ≈ 0.5), whereas lesion-region-based local patch pooling nearly fully recovers discriminative signal, boosting AUC by 0.412–0.488 to nearly 1.0 and achieving ≥0.899 across all model–class combinations on ChestX-Det10, thereby challenging the standard paradigm of using frozen model embeddings.
📝 Abstract
Frozen vision-transformer (ViT) foundation-model embeddings increasingly serve as the substrate for downstream chest-radiography (CXR) pipelines, yet where small-scale, low-contrast signal is retained or lost in the frozen forward pass has not been systematically quantified across architectures, pretraining domains, and objectives. We probed five frozen ViTs (RAD-DINO, DINOv2-B/14, DINOv3 ViT-7B, BiomedCLIP, MedSigLIP) and a frozen DINO-pretrained ResNet-50 architectural control across three large CXR cohorts (NIH-CXR14, MIMIC-CXR, Emory-CXR; aggregate pool n=492,724) and ChestX-Det10 (n=3,543; 1,462 small-lesion bounding boxes across Calcification, Nodule, Mass). Each model was evaluated with a small-scale-perturbation panel and a region-aware bounding-box-stratified probe on real lesions, comparing three pooling modes from the same forward pass: classification token (CLS), patch-mean (mean over all final-layer patch tokens), and bounding-box-restricted patch-local. On the perturbation panel, CLS embeddings sat at the chance floor (area under the ROC curve [AUC] 0.500-0.524); patch-mean was indistinguishable from CLS on iso-blur and reticular-fine cells but rose with CLS on larger directional-blur footprints, while disease AUC on globally decided tasks ranged 0.642-0.913. Patch-local probes recovered AUC ~1.0 from the same forward pass (per-model mean improvement +0.412 to +0.488); the ResNet-50 control reproduced the chance floor. On ChestX-Det10, image-level CLS classification showed within-class small-versus-large stratum gaps up to +0.243 AUC; bounding-box-level patch-local pooling on the same forward pass recovered AUC >= 0.899 on every (model x class) cell. Frozen ViT embeddings silently suppress small-scale signal at the global-aggregation step; the signal is recoverable from patch tokens conditional on a region of interest.
Problem

Research questions and friction points this paper is trying to address.

small-lesion detection
chest radiography
frozen foundation models
signal suppression
embedding aggregation
Innovation

Methods, ideas, or system contributions that make the work stand out.

frozen foundation models
small-lesion detection
patch-local pooling
chest radiography
vision transformers
🔎 Similar Papers
No similar papers found.
R
Raajitha Muthyala
Department of Biomedical Engineering and Informatics, Indiana University, 535 W Michigan St., IT 475J, Indianapolis, IN, 46202, USA
Z
Zhenan Yin
Department of Biomedical Engineering and Informatics, Indiana University, 535 W Michigan St., IT 475J, Indianapolis, IN, 46202, USA
A
Alekhya Jilla
Department of Biomedical Engineering and Informatics, Indiana University, 535 W Michigan St., IT 475J, Indianapolis, IN, 46202, USA
Frank Li
Frank Li
Emory University
Medial ImagingHealthcare InformaticsDeep Learningand Computational Fluid Dynamics
Theo Dapamede
Theo Dapamede
Emory University
Artificial IntelligenceRadiologyImaging InformaticsPhoton Counting CT
Bardia Khosravi
Bardia Khosravi
Radiology Resident @ Yale
RadiologyArtificial IntelligenceImaging Informatics
Mohammadreza Chavoshi
Mohammadreza Chavoshi
MD, Postdoctoral Researcher, Emory University
Radiologymeta-analysisArtificial Intelligence
J
Judy Gichoya
Department of Radiology and Imaging Sciences, Emory University, 1364 Clifton Rd NE, Atlanta, GA, 30322, USA
Saptarshi Purkayastha
Saptarshi Purkayastha
Indiana University Indianapolis
global healthEHRimaging informaticsmHealthinformation infrastructure