HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the limitations of existing biomedical question-answering models, which prioritize answer accuracy while neglecting interpretability, reliability of structured reasoning, and the ability to identify weak or uncertain answers. To overcome these shortcomings, the authors propose HypothesisMed, a novel framework that integrates direct answering, chain-of-thought reasoning, and HypothesisMed-v3 prompting. The approach generates a final output through multi-strategy answer fusion and introduces SPACE labels—categorized as VALID, INCOMPLETE, or CONTRADICTED—alongside confidence scores to enable auditable and interpretable reasoning. For the first time, this method combines structured hypothesis-space diagnostics with answer fusion, achieving substantial performance gains on benchmarks such as MedQA: the weighted accuracy of Phi-4-mini improves from 0.4296 to 0.5192, and Qwen2.5-7B attains zero erroneous assertions with full SPACE coverage.

📝 Abstract

Biomedical question answering with large language models is commonly evaluated using answer accuracy, but answer accuracy alone does not indicate whether a model can produce parseable outputs, follow structured reliability instructions, recognize weak answer spaces, or avoid confident incorrect commitments. This paper presents HypothesisMed, an inference-time reliability pipeline for biomedical multiple-choice question answering. It combines direct, chain-of-thought, HypothesisMed-v3 prompting, and answer fusion. The final answer is selected by fusion, while HypothesisMed-v3 supplies SPACE labels and confidence information. SPACE labels mark the answer space as VALID, INCOMPLETE, or CONTRADICTED. We evaluate Qwen2.5-7B, Phi-4-mini, DeepSeek-R1-32B, and BioMistral-7B on MedQA, MedMCQA, and PubMedQA using 1,000 examples per dataset. The pipeline improves weighted accuracy over each model's best direct or chain-of-thought baseline while increasing parse and SPACE coverage. We also scale evaluation to Qwen2.5-7B and Phi-4-mini using 10,183 examples per model. Fusion improves Phi-4-mini accuracy from 0.4296 to 0.5192, while Qwen2.5-7B chain-of-thought remains slightly higher in answer accuracy. However, Qwen2.5-7B fusion achieves complete parse and SPACE coverage with much lower false commitment. A 12,000-example SPACE stress test shows answer-space diagnosis remains difficult, with SPACE accuracy of 0.3074 for Qwen2.5-7B and 0.4168 for Phi-4-mini. These results show that answer accuracy, parseability, structured reliability reporting, calibration behavior, and false-commitment behavior are separable capabilities. The main contribution is not a universal state-of-the-art claim, but a reproducible inference-time framework for evaluating biomedical question answering models as auditable workflow components under structured reliability constraints.

Problem

Research questions and friction points this paper is trying to address.

biomedical question answering

answer accuracy

structured reliability

parseability

false commitment

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time fusion

structured reliability reporting

SPACE labeling