🤖 AI Summary
This work addresses the limitations of existing biomedical question-answering models, which prioritize answer accuracy while neglecting interpretability, reliability of structured reasoning, and the ability to identify weak or uncertain answers. To overcome these shortcomings, the authors propose HypothesisMed, a novel framework that integrates direct answering, chain-of-thought reasoning, and HypothesisMed-v3 prompting. The approach generates a final output through multi-strategy answer fusion and introduces SPACE labels—categorized as VALID, INCOMPLETE, or CONTRADICTED—alongside confidence scores to enable auditable and interpretable reasoning. For the first time, this method combines structured hypothesis-space diagnostics with answer fusion, achieving substantial performance gains on benchmarks such as MedQA: the weighted accuracy of Phi-4-mini improves from 0.4296 to 0.5192, and Qwen2.5-7B attains zero erroneous assertions with full SPACE coverage.
📝 Abstract
Biomedical question answering with large language models is commonly evaluated using answer accuracy, but answer accuracy alone does not indicate whether a model can produce parseable outputs, follow structured reliability instructions, recognize weak answer spaces, or avoid confident incorrect commitments. This paper presents HypothesisMed, an inference-time reliability pipeline for biomedical multiple-choice question answering. It combines direct, chain-of-thought, HypothesisMed-v3 prompting, and answer fusion. The final answer is selected by fusion, while HypothesisMed-v3 supplies SPACE labels and confidence information. SPACE labels mark the answer space as VALID, INCOMPLETE, or CONTRADICTED. We evaluate Qwen2.5-7B, Phi-4-mini, DeepSeek-R1-32B, and BioMistral-7B on MedQA, MedMCQA, and PubMedQA using 1,000 examples per dataset. The pipeline improves weighted accuracy over each model's best direct or chain-of-thought baseline while increasing parse and SPACE coverage. We also scale evaluation to Qwen2.5-7B and Phi-4-mini using 10,183 examples per model. Fusion improves Phi-4-mini accuracy from 0.4296 to 0.5192, while Qwen2.5-7B chain-of-thought remains slightly higher in answer accuracy. However, Qwen2.5-7B fusion achieves complete parse and SPACE coverage with much lower false commitment. A 12,000-example SPACE stress test shows answer-space diagnosis remains difficult, with SPACE accuracy of 0.3074 for Qwen2.5-7B and 0.4168 for Phi-4-mini. These results show that answer accuracy, parseability, structured reliability reporting, calibration behavior, and false-commitment behavior are separable capabilities. The main contribution is not a universal state-of-the-art claim, but a reproducible inference-time framework for evaluating biomedical question answering models as auditable workflow components under structured reliability constraints.