🤖 AI Summary
This study systematically evaluates the potential of large language models (LLMs) for automated coding of clinical narratives into the International Classification of Primary Care, Second Edition (ICPC-2), aiming to enhance the structural integrity of primary care data for research, quality assurance, and health policy.
Method: We propose a retrieval-augmented framework: first, a semantic retriever built on text-embedding-3-large generates candidate ICPC-2 codes; then, an un-fine-tuned LLM selects the optimal code from this shortlist.
Contribution/Results: We conduct the first zero-shot ICPC-2 coding benchmark across 33 mainstream LLMs, demonstrating that semantic retrieval critically improves performance. Experimental results show that 28 models achieve F1 ≥ 0.8 (10 exceed 0.85), with gpt-4.5-preview attaining top performance. Most models produce syntactically valid outputs with low hallucination rates. This work establishes a reproducible, efficient, and low-barrier technical pathway for automating medical classification in primary care settings.
📝 Abstract
Background: Medical coding structures healthcare data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign ICPC-2 codes using the output of a domain-specific search engine.
Methods: A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI's text-embedding-3-large) retrieved candidates from 73,563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence.
Results: Twenty-eight models achieved F1-score > 0.8; ten exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance by up to 4 points. Most models returned valid codes in the expected format, with reduced hallucinations. Smaller models (<3B) struggled with formatting and input length.
Conclusions: LLMs show strong potential for automating ICPC-2 coding, even without fine-tuning. This work offers a benchmark and highlights challenges, but findings are limited by dataset scope and setup. Broader, multilingual, end-to-end evaluations are needed for clinical validation.