Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care

📅 2025-07-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study systematically evaluates the potential of large language models (LLMs) for automated coding of clinical narratives into the International Classification of Primary Care, Second Edition (ICPC-2), aiming to enhance the structural integrity of primary care data for research, quality assurance, and health policy. Method: We propose a retrieval-augmented framework: first, a semantic retriever built on text-embedding-3-large generates candidate ICPC-2 codes; then, an un-fine-tuned LLM selects the optimal code from this shortlist. Contribution/Results: We conduct the first zero-shot ICPC-2 coding benchmark across 33 mainstream LLMs, demonstrating that semantic retrieval critically improves performance. Experimental results show that 28 models achieve F1 ≥ 0.8 (10 exceed 0.85), with gpt-4.5-preview attaining top performance. Most models produce syntactically valid outputs with low hallucination rates. This work establishes a reproducible, efficient, and low-barrier technical pathway for automating medical classification in primary care settings.

Technology Category

Application Category

📝 Abstract

Background: Medical coding structures healthcare data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign ICPC-2 codes using the output of a domain-specific search engine. Methods: A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI's text-embedding-3-large) retrieved candidates from 73,563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence. Results: Twenty-eight models achieved F1-score > 0.8; ten exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance by up to 4 points. Most models returned valid codes in the expected format, with reduced hallucinations. Smaller models (<3B) struggled with formatting and input length. Conclusions: LLMs show strong potential for automating ICPC-2 coding, even without fine-tuning. This work offers a benchmark and highlights challenges, but findings are limited by dataset scope and setup. Broader, multilingual, end-to-end evaluations are needed for clinical validation.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs for ICPC-2 medical code assignment

Evaluating performance of 33 LLMs in coding clinical expressions

Benchmarking LLMs for healthcare data automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs automate ICPC-2 coding without fine-tuning

Semantic search engine retrieves candidate codes

Benchmark evaluates performance, cost, and format adherence

🔎 Similar Papers

No similar papers found.

Authors to Follow