Artificial Intelligence health advice accuracy varies across languages and contexts

📅 2025-04-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating the accuracy of large language models (LLMs) in delivering evidence-based health advice across diverse languages and domains remains a critical challenge, particularly for sensitive public health topics. Method: We systematically assessed six state-of-the-art LLMs on 9,100 journalist-verified health claims spanning 21 languages and nine thematic categories (e.g., abortion, COVID-19, politics), sourced from peer-reviewed journals, governmental guidelines, social media, and news outlets. We introduced the first global empirical benchmark for health AI advice—incorporating multilingualism, multimodality of sources, and domain heterogeneity—and proposed a domain-aware, language-fair verification framework. Contribution/Results: While models achieved high accuracy on English claims, performance dropped by 27% on average for non-European languages; consistency deteriorated markedly on sensitive topics (e.g., abortion) and non-official sources. Our findings underscore the necessity of mandatory multilingual and domain-adapted validation, providing both methodological foundations and empirical evidence to guide responsible global deployment of health AI systems.

Technology Category

Application Category

📝 Abstract
Using basic health statements authorized by UK and EU registers and 9,100 journalist-vetted public-health assertions on topics such as abortion, COVID-19 and politics from sources ranging from peer-reviewed journals and government advisories to social media and news across the political spectrum, we benchmark six leading large language models from in 21 languages, finding that, despite high accuracy on English-centric textbook claims, performance falls in multiple non-European languages and fluctuates by topic and source, highlighting the urgency of comprehensive multilingual, domain-aware validation before deploying AI in global health communication.
Problem

Research questions and friction points this paper is trying to address.

Assess AI health advice accuracy across languages and contexts
Benchmark large language models in 21 languages for health claims
Identify performance gaps in non-European languages and topics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking six large language models in 21 languages
Using authorized health statements and journalist-vetted assertions
Highlighting multilingual and domain-aware validation urgency