Assessing Large Language Models for Structured Medical Order Extraction

📅 2025-10-12
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Extracting structured medical instructions—specifically type, description, execution rationale, and source—from multi-turn clinician–patient dialogues remains challenging due to sparse annotations and domain-specific adaptation requirements. Method: We propose a domain-agnostic, few-shot instruction-structured extraction method leveraging instruction-tuned LLaMA-4 (17B), requiring only one high-quality in-context example and meticulously engineered prompts—without domain fine-tuning. Contribution/Results: Our approach eliminates reliance on large-scale labeled data and task-specific adaptation, demonstrating strong generalization in clinical NLP. Evaluated on the MEDIQA-OE 2025 challenge, it achieved fifth place overall (mean F1 = 37.76), with particularly notable performance on rationale and source identification—outperforming most competing systems. This validates the efficacy and transferability of prompt-driven large language models for complex, low-resource clinical information extraction tasks.

Technology Category

Application Category

📝 Abstract
Medical order extraction is essential for structuring actionable clinical information, supporting decision-making, and enabling downstream applications such as documentation and workflow automation. Orders may be embedded in diverse sources, including electronic health records, discharge summaries, and multi-turn doctor-patient dialogues, and can span categories such as medications, laboratory tests, imaging studies, and follow-up actions. The MEDIQA-OE 2025 shared task focuses on extracting structured medical orders from extended conversational transcripts, requiring the identification of order type, description, reason, and provenance. We present the MasonNLP submission, which ranked 5th among 17 participating teams with 105 total submissions. Our approach uses a general-purpose, instruction-tuned LLaMA-4 17B model without domain-specific fine-tuning, guided by a single in-context example. This few-shot configuration achieved an average F1 score of 37.76, with notable improvements in reason and provenance accuracy. These results demonstrate that large, non-domain-specific LLMs, when paired with effective prompt engineering, can serve as strong, scalable baselines for specialized clinical NLP tasks.
Problem

Research questions and friction points this paper is trying to address.

Extracting structured medical orders from clinical conversations
Identifying order types descriptions reasons and provenance
Evaluating general-purpose LLMs for specialized clinical NLP tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses instruction-tuned LLaMA-4 17B model
Employs single in-context example guidance
Leverages few-shot configuration without fine-tuning
🔎 Similar Papers
No similar papers found.