Enhancing Visual Inspection Capability of Multi-Modal Large Language Models on Medical Time Series with Supportive Conformalized and Interpretable Small Specialized Models

📅 2025-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit strong generalization but suffer from low specialty-specific accuracy and poor interpretability in medical time-series analysis. To address this, we propose ConMIL—a plug-and-play decision-support small model that innovatively integrates multiple instance learning (MIL) with conformal prediction (CP). This synergy enables fine-grained localization of clinically relevant signal segments and produces calibrated, reliability-guaranteed confidence scores, thereby enhancing both accuracy and interpretability. ConMIL operates synergistically with a multimodal LLM (Qwen2-VL-7B), significantly improving high-confidence sample accuracy for arrhythmia detection (94.92%) and sleep staging (96.82%), outperforming pure-LLM baselines by over 48 percentage points. Our work establishes a new paradigm for medical time-series AI interpretation—one that jointly optimizes diagnostic precision, robustness, and clinical deployability.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) exhibit remarkable capabilities in visual inspection of medical time-series data, achieving proficiency comparable to human clinicians. However, their broad scope limits domain-specific precision, and proprietary weights hinder fine-tuning for specialized datasets. In contrast, small specialized models (SSMs) excel in targeted tasks but lack the contextual reasoning required for complex clinical decision-making. To address these challenges, we propose ConMIL (Conformalized Multiple Instance Learning), a decision-support SSM that integrates seamlessly with LLMs. By using Multiple Instance Learning (MIL) to identify clinically significant signal segments and conformal prediction for calibrated set-valued outputs, ConMIL enhances LLMs' interpretative capabilities for medical time-series analysis. Experimental results demonstrate that ConMIL significantly improves the performance of state-of-the-art LLMs, such as ChatGPT4.0 and Qwen2-VL-7B. Specifically, ConMIL{}-supported Qwen2-VL-7B achieves 94.92% and 96.82% precision for confident samples in arrhythmia detection and sleep staging, compared to standalone LLM accuracy of 46.13% and 13.16%. These findings highlight the potential of ConMIL to bridge task-specific precision and broader contextual reasoning, enabling more reliable and interpretable AI-driven clinical decision support.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Medical Time Series Data
Specialized Task Precision
Innovation

Methods, ideas, or system contributions that make the work stand out.

ConMIL
Medical Time Series Analysis
Accuracy Enhancement
🔎 Similar Papers
No similar papers found.