Haibu Mathematical-Medical Intelligent Agent:Enhancing Large Language Model Reliability in Medical Tasks via Verifiable Reasoning Chains

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In high-stakes medical scenarios, large language models (LLMs) suffer from factual inaccuracies and logical unreliability. To address this, we propose a verifiable reasoning framework that employs recursive task decomposition, evidence-driven atomic reasoning, and automated logical auditing to construct traceable and verifiable reasoning chains. We introduce a novel “theorem-style knowledge bootstrapping” mechanism: formally verified reasoning chains are distilled into reusable knowledge units, which—integrated with retrieval-augmented generation (RAG)—enable a paradigm shift from first-principles reasoning to lightweight verification. Evaluated on an expert-annotated benchmark, our method achieves a 98.2% error detection rate with a false positive rate below 1%. Once the knowledge base matures, inference cost is projected to decrease by 85%, substantially outperforming existing baselines.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) show promise in medicine but are prone to factual and logical errors, which is unacceptable in this high-stakes field. To address this, we introduce the "Haibu Mathematical-Medical Intelligent Agent" (MMIA), an LLM-driven architecture that ensures reliability through a formally verifiable reasoning process. MMIA recursively breaks down complex medical tasks into atomic, evidence-based steps. This entire reasoning chain is then automatically audited for logical coherence and evidence traceability, similar to theorem proving. A key innovation is MMIA's "bootstrapping" mode, which stores validated reasoning chains as "theorems." Subsequent tasks can then be efficiently solved using Retrieval-Augmented Generation (RAG), shifting from costly first-principles reasoning to a low-cost verification model. We validated MMIA across four healthcare administration domains, including DRG/DIP audits and medical insurance adjudication, using expert-validated benchmarks. Results showed MMIA achieved an error detection rate exceeding 98% with a false positive rate below 1%, significantly outperforming baseline LLMs. Furthermore, the RAG matching mode is projected to reduce average processing costs by approximately 85% as the knowledge base matures. In conclusion, MMIA's verifiable reasoning framework is a significant step toward creating trustworthy, transparent, and cost-effective AI systems, making LLM technology viable for critical applications in medicine.
Problem

Research questions and friction points this paper is trying to address.

Addressing factual and logical errors in medical LLM applications
Ensuring reliability through verifiable reasoning chains in medicine
Reducing processing costs while maintaining high accuracy standards
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM agent uses verifiable reasoning chains
Bootstrapping stores validated chains as theorems
RAG enables low-cost verification for efficiency
🔎 Similar Papers
No similar papers found.