Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluations in healthcare predominantly rely on generic or domain-specific benchmarks detached from real-world clinical and community contexts, overlooking local community needs, cultural nuances, and everyday health practices. Method: We introduce Samiksha—the first community-driven, scalable evaluation framework for LLMs in healthcare—co-developed with grassroots organizations and community members across diverse Indian regions. Integrating qualitative social research with automated testing, Samiksha features multilingual interfaces, iterative feedback loops, and a culturally grounded scoring system. Contribution/Results: Its core innovation lies in centering cultural sensitivity and dynamic community input across the entire evaluation lifecycle—content generation, dataset construction, and result interpretation—enabling participatory governance. Empirical evaluation reveals significant limitations of state-of-the-art multilingual LLMs on complex, locally grounded health consultation tasks. Samiksha establishes a reproducible, scalable paradigm for community-embedded LLM assessment.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are typically evaluated through general or domain-specific benchmarks testing capabilities that often lack grounding in the lived realities of end users. Critical domains such as healthcare require evaluations that extend beyond artificial or simulated tasks to reflect the everyday needs, cultural practices, and nuanced contexts of communities. We propose Samiksha, a community-driven evaluation pipeline co-created with civil-society organizations (CSOs) and community members. Our approach enables scalable, automated benchmarking through a culturally aware, community-driven pipeline in which community feedback informs what to evaluate, how the benchmark is built, and how outputs are scored. We demonstrate this approach in the health domain in India. Our analysis highlights how current multilingual LLMs address nuanced community health queries, while also offering a scalable pathway for contextually grounded and inclusive LLM evaluation.
Problem

Research questions and friction points this paper is trying to address.

Current LLM evaluations lack grounding in end users' real-world healthcare contexts
Healthcare requires culturally aware evaluation beyond artificial simulated tasks
Existing benchmarks fail to reflect community needs and cultural practices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Community-driven pipeline co-created with organizations
Culturally aware automated benchmarking through community feedback
Scalable evaluation method for contextual LLM assessment
🔎 Similar Papers
No similar papers found.
H
Hamna
Microsoft Research India, Bengaluru, India
G
Gayatri Bhat
Karya, Bengaluru, India
Sourabrata Mukherjee
Sourabrata Mukherjee
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics
NLGLLMNLPDLML
F
Faisal Lalani
Collective Intelligence Project, New York, US
E
Evan Hadfield
Collective Intelligence Project, New York, US
D
Divya Siddarth
Collective Intelligence Project, New York, US
K
Kalika Bali
Microsoft Research India, Bengaluru, India
Sunayana Sitaram
Sunayana Sitaram
Microsoft Research India
Multilingual NLPevaluationLLMs and culturemultilingualismLLMs