π€ AI Summary
This work addresses the critical limitation of current clinical large language models in dynamically adapting treatment decisions when patient contexts changeβa capability inadequately assessed by conventional medical question-answering benchmarks. To bridge this gap, the authors propose ClinPivot, the first auditable evaluation benchmark specifically designed to measure dynamic adaptability in therapeutic decision-making. ClinPivot leverages a biomedical knowledge graph to construct interpretable context-perturbation pairs and incorporates structured decision supervision alongside a lightweight replay mechanism to enhance contextual sensitivity under constrained knowledge budgets. Experimental results reveal that state-of-the-art models, including Qwen variants, perform poorly on ClinPivot, whereas the proposed approach significantly improves dynamic decision-making without compromising general assistant capabilities, thereby exposing a notable disconnect between standard medical QA accuracy and genuine clinical reasoning proficiency.
π Abstract
Clinical foundation models are evaluated with factual or exam-style medical QA, but treatment decisions must change when patient context changes. We introduce ClinPivot, an auditable treatment-decision benchmark built from biomedical relations and pivoted patient contexts. ClinPivot asks whether models change treatment choices when new clinical constraints shift the action space. We find that strong medical QA performance does not reliably predict decision-making performance: frontier models and task-adapted Qwen variants often fail to change decisions correctly, and model rankings shift across evaluation regimes. Decision-structured supervision improves pivot-sensitive decision-making and medical QA under matched knowledge budgets, while lightweight replay reduces losses in general assistant ability.