🤖 AI Summary
This study addresses the critical issue that frozen vision transformers (ViTs) inadvertently suppress signals from small, low-contrast lesions in chest X-rays when using standard global pooling strategies—such as CLS token or mean pooling—leading to severely degraded downstream detection performance. Through systematic evaluation of five frozen ViT variants (including RAD-DINO, DINOv2, and BiomedCLIP) alongside ResNet-50 on large-scale datasets like NIH-CXR14 and MIMIC-CXR, the authors employ lesion-localized bounding boxes to extract region-restricted embeddings and assess performance via AUC. They demonstrate for the first time that conventional pooling reduces small lesion detection to near-random levels (AUC ≈ 0.5), whereas lesion-region-based local patch pooling nearly fully recovers discriminative signal, boosting AUC by 0.412–0.488 to nearly 1.0 and achieving ≥0.899 across all model–class combinations on ChestX-Det10, thereby challenging the standard paradigm of using frozen model embeddings.
📝 Abstract
Frozen vision-transformer (ViT) foundation-model embeddings increasingly serve as the substrate for downstream chest-radiography (CXR) pipelines, yet where small-scale, low-contrast signal is retained or lost in the frozen forward pass has not been systematically quantified across architectures, pretraining domains, and objectives. We probed five frozen ViTs (RAD-DINO, DINOv2-B/14, DINOv3 ViT-7B, BiomedCLIP, MedSigLIP) and a frozen DINO-pretrained ResNet-50 architectural control across three large CXR cohorts (NIH-CXR14, MIMIC-CXR, Emory-CXR; aggregate pool n=492,724) and ChestX-Det10 (n=3,543; 1,462 small-lesion bounding boxes across Calcification, Nodule, Mass). Each model was evaluated with a small-scale-perturbation panel and a region-aware bounding-box-stratified probe on real lesions, comparing three pooling modes from the same forward pass: classification token (CLS), patch-mean (mean over all final-layer patch tokens), and bounding-box-restricted patch-local. On the perturbation panel, CLS embeddings sat at the chance floor (area under the ROC curve [AUC] 0.500-0.524); patch-mean was indistinguishable from CLS on iso-blur and reticular-fine cells but rose with CLS on larger directional-blur footprints, while disease AUC on globally decided tasks ranged 0.642-0.913. Patch-local probes recovered AUC ~1.0 from the same forward pass (per-model mean improvement +0.412 to +0.488); the ResNet-50 control reproduced the chance floor. On ChestX-Det10, image-level CLS classification showed within-class small-versus-large stratum gaps up to +0.243 AUC; bounding-box-level patch-local pooling on the same forward pass recovered AUC >= 0.899 on every (model x class) cell. Frozen ViT embeddings silently suppress small-scale signal at the global-aggregation step; the signal is recoverable from patch tokens conditional on a region of interest.