Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM evaluations in healthcare predominantly rely on generic or domain-specific benchmarks detached from real-world clinical and community contexts, overlooking local community needs, cultural nuances, and everyday health practices. Method: We introduce Samiksha—the first community-driven, scalable evaluation framework for LLMs in healthcare—co-developed with grassroots organizations and community members across diverse Indian regions. Integrating qualitative social research with automated testing, Samiksha features multilingual interfaces, iterative feedback loops, and a culturally grounded scoring system. Contribution/Results: Its core innovation lies in centering cultural sensitivity and dynamic community input across the entire evaluation lifecycle—content generation, dataset construction, and result interpretation—enabling participatory governance. Empirical evaluation reveals significant limitations of state-of-the-art multilingual LLMs on complex, locally grounded health consultation tasks. Samiksha establishes a reproducible, scalable paradigm for community-embedded LLM assessment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are typically evaluated through general or domain-specific benchmarks testing capabilities that often lack grounding in the lived realities of end users. Critical domains such as healthcare require evaluations that extend beyond artificial or simulated tasks to reflect the everyday needs, cultural practices, and nuanced contexts of communities. We propose Samiksha, a community-driven evaluation pipeline co-created with civil-society organizations (CSOs) and community members. Our approach enables scalable, automated benchmarking through a culturally aware, community-driven pipeline in which community feedback informs what to evaluate, how the benchmark is built, and how outputs are scored. We demonstrate this approach in the health domain in India. Our analysis highlights how current multilingual LLMs address nuanced community health queries, while also offering a scalable pathway for contextually grounded and inclusive LLM evaluation.

Problem

Research questions and friction points this paper is trying to address.

Current LLM evaluations lack grounding in end users' real-world healthcare contexts

Healthcare requires culturally aware evaluation beyond artificial simulated tasks

Existing benchmarks fail to reflect community needs and cultural practices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Community-driven pipeline co-created with organizations

Culturally aware automated benchmarking through community feedback

Scalable evaluation method for contextual LLM assessment

🔎 Similar Papers

No similar papers found.

Authors to Follow