GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenge of precisely aligning local pathological regions in medical images with corresponding textual descriptions under global image-report alignment, where visual-language supervision signals suffer from sparse and mismatched fine-grained correspondence. To this end, the authors propose GLINT, a framework that employs a sparse gating mechanism to activate only image regions relevant to a given text query, while incorporating dense feature regularization to preserve fine-grained visual details for accurate cross-modal alignment. Built upon DINOv3 (for 2D chest X-rays) and V-JEPA 2.1 (for 3D CT), the model achieves, for the first time, mask-free zero-shot segmentation on 3D chest CT scans. Experiments demonstrate that GLINT significantly outperforms existing medical vision-language models and self-supervised approaches across zero-shot classification, localization, and segmentation tasks, with notable breakthroughs in 3D CT zero-shot segmentation.

📝 Abstract

Vision-language models (VLMs) for radiology have emerged as a scalable paradigm by leveraging image-report pairs naturally produced in clinical workflows. However, this pairing reveals a mismatch in scale: each finding occupies only a small region of the image, yet supervision is provided only at the global image-report level. This poses a central challenge: prior approaches spread weight densely across all patches rather than concentrating on the sparse subset relevant to a given query. To address this, we present GLINT (Gated Language-Image alignmeNT), a framework that explicitly models this sparse correspondence. On the alignment side, we introduce Sparsely Gated Alignment, a novel architecture in which a sigmoid gate over a separate gate embedding space activates only the patches relevant to each textual query, enforcing explicit sparsity. On the representation side, we add Dense Feature Regularization, which anchors the trainable encoder's intermediate features to a frozen self-supervised learning (SSL) teacher, preserving the fine-grained patch features that the gate relies on. The same recipe applies to both 2D chest X-ray (CXR) and 3D chest computed tomography (CT), built with DINOv3 and V-JEPA 2.1, respectively. GLINT enables zero-shot classification, grounding, and segmentation from free-text queries, and to our knowledge is the first to demonstrate zero-shot segmentation on 3D CT volumes without mask supervision. Notably, the most pronounced gains arise on zero-shot grounding and segmentation, where sparse, query-specific localization is required, consistent with our design intent. In downstream evaluation, GLINT outperforms both SSL encoders and medical VLMs on classification, report generation, and segmentation.

Problem

Research questions and friction points this paper is trying to address.

vision-language alignment

sparsity

radiology

fine-grained representation

image-report mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparsely Gated Alignment

Vision-Language Models

Zero-shot Segmentation