Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care

📅 2025-07-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the potential of large language models (LLMs) for automated coding of clinical narratives into the International Classification of Primary Care, Second Edition (ICPC-2), aiming to enhance the structural integrity of primary care data for research, quality assurance, and health policy. Method: We propose a retrieval-augmented framework: first, a semantic retriever built on text-embedding-3-large generates candidate ICPC-2 codes; then, an un-fine-tuned LLM selects the optimal code from this shortlist. Contribution/Results: We conduct the first zero-shot ICPC-2 coding benchmark across 33 mainstream LLMs, demonstrating that semantic retrieval critically improves performance. Experimental results show that 28 models achieve F1 ≥ 0.8 (10 exceed 0.85), with gpt-4.5-preview attaining top performance. Most models produce syntactically valid outputs with low hallucination rates. This work establishes a reproducible, efficient, and low-barrier technical pathway for automating medical classification in primary care settings.

Technology Category

Application Category

📝 Abstract
Background: Medical coding structures healthcare data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign ICPC-2 codes using the output of a domain-specific search engine. Methods: A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI's text-embedding-3-large) retrieved candidates from 73,563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence. Results: Twenty-eight models achieved F1-score > 0.8; ten exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance by up to 4 points. Most models returned valid codes in the expected format, with reduced hallucinations. Smaller models (<3B) struggled with formatting and input length. Conclusions: LLMs show strong potential for automating ICPC-2 coding, even without fine-tuning. This work offers a benchmark and highlights challenges, but findings are limited by dataset scope and setup. Broader, multilingual, end-to-end evaluations are needed for clinical validation.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs for ICPC-2 medical code assignment
Evaluating performance of 33 LLMs in coding clinical expressions
Benchmarking LLMs for healthcare data automation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs automate ICPC-2 coding without fine-tuning
Semantic search engine retrieves candidate codes
Benchmark evaluates performance, cost, and format adherence
🔎 Similar Papers
No similar papers found.
V
Vinicius Anjos de Almeida
Medical School, University of São Paulo, São Paulo, Brazil
V
Vinicius de Camargo
Department of Epidemiology, School of Public Health, University of São Paulo, São Paulo, Brazil
R
Raquel Gómez-Bravo
Rehaklinik, Centre Hospitalier Neuro-psychiatrique (CHNP), Ettelbruck, Luxembourg
K
Kees van Boven
Department of Primary and Community Care, Radboud University, Nijmegen, Netherlands
E
Egbert van der Haring
Independant researcher, Netherlands
Marcelo Finger
Marcelo Finger
Professor of Computer Science, Universidade de São Paulo
Artificial IntelligenceComputational LogicNatural Language Processing
L
Luis Fernandez Lopez
Medical School, University of São Paulo, São Paulo, Brazil