Teaching LLMs Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study addresses the inadequate performance of existing large language models on clinical guideline knowledge in Brazilian Portuguese and the absence of dedicated evaluation benchmarks. To bridge this gap, the authors construct a high-quality synthetic dataset comprising approximately 70 million tokens derived from 178 official clinical guidelines, along with two new evaluation benchmarks—HealthBench-BR and PCDT-QA. They enhance generation diversity through multi-format data construction, including question-answer pairs, paraphrased texts, and Wikipedia-style articles, and fine-tune the Qwen2.5-14B-Instruct model using continued pretraining followed by Group Relative Policy Optimization (GRPO). The resulting model achieves state-of-the-art performance with scores of 83.9% and 85.4% on HealthBench-BR and PCDT-QA, respectively, outperforming larger models such as GPT-5.2 and Claude Sonnet 4.6. All datasets, benchmarks, and model weights are publicly released.

📝 Abstract

Brazil's Unified Health System (SUS) relies on official clinical guidelines that define diagnostic criteria, treatments, dosages, and monitoring procedures for over 200 million citizens. Yet current LLMs perform poorly on this guideline-specific knowledge, and no benchmark evaluates clinical recall grounded in Brazilian Portuguese protocols. We address this gap by adapting Qwen2.5-14B-Instruct to the Brazilian clinical domain. From 178 official guidelines (~5.4M tokens), we generate ~70M tokens of synthetic data in three formats -- rephrases, wiki-style articles, and question-answer pairs -- using four generator LLMs. We then apply continual pre-training followed by Group Relative Policy Optimization (GRPO). We introduce HealthBench-BR, with 1,780 balanced true/false clinical assertions, and PCDT-QA, with 890 open-ended clinical questions scored by an LLM judge. Our best model achieves 83.9% on HealthBench-BR and 85.4% on PCDT-QA, outperforming GPT-5.2, Claude Sonnet 4.6, Gemini 3.1 Pro, and Google AI Overview's web-grounded RAG despite having only 14B parameters. Ablations show that generator diversity and reinforcement learning are critical to these gains. We release all datasets, benchmarks, and model weights to support reproducible clinical NLP research for Brazilian Portuguese. Code, data, and model weights are available at https://github.com/hugoabonizio/clinical-protocols-br

Problem

Research questions and friction points this paper is trying to address.

Brazilian clinical guidelines

large language models

clinical knowledge

benchmark

Brazilian Portuguese

Innovation

Methods, ideas, or system contributions that make the work stand out.

clinical guidelines

synthetic data generation

Group Relative Policy Optimization