🤖 AI Summary
Evaluating the accuracy of large language models (LLMs) in delivering evidence-based health advice across diverse languages and domains remains a critical challenge, particularly for sensitive public health topics. Method: We systematically assessed six state-of-the-art LLMs on 9,100 journalist-verified health claims spanning 21 languages and nine thematic categories (e.g., abortion, COVID-19, politics), sourced from peer-reviewed journals, governmental guidelines, social media, and news outlets. We introduced the first global empirical benchmark for health AI advice—incorporating multilingualism, multimodality of sources, and domain heterogeneity—and proposed a domain-aware, language-fair verification framework. Contribution/Results: While models achieved high accuracy on English claims, performance dropped by 27% on average for non-European languages; consistency deteriorated markedly on sensitive topics (e.g., abortion) and non-official sources. Our findings underscore the necessity of mandatory multilingual and domain-adapted validation, providing both methodological foundations and empirical evidence to guide responsible global deployment of health AI systems.
📝 Abstract
Using basic health statements authorized by UK and EU registers and 9,100 journalist-vetted public-health assertions on topics such as abortion, COVID-19 and politics from sources ranging from peer-reviewed journals and government advisories to social media and news across the political spectrum, we benchmark six leading large language models from in 21 languages, finding that, despite high accuracy on English-centric textbook claims, performance falls in multiple non-European languages and fluctuates by topic and source, highlighting the urgency of comprehensive multilingual, domain-aware validation before deploying AI in global health communication.