🤖 AI Summary
Existing LLM evaluations in healthcare predominantly rely on generic or domain-specific benchmarks detached from real-world clinical and community contexts, overlooking local community needs, cultural nuances, and everyday health practices.
Method: We introduce Samiksha—the first community-driven, scalable evaluation framework for LLMs in healthcare—co-developed with grassroots organizations and community members across diverse Indian regions. Integrating qualitative social research with automated testing, Samiksha features multilingual interfaces, iterative feedback loops, and a culturally grounded scoring system.
Contribution/Results: Its core innovation lies in centering cultural sensitivity and dynamic community input across the entire evaluation lifecycle—content generation, dataset construction, and result interpretation—enabling participatory governance. Empirical evaluation reveals significant limitations of state-of-the-art multilingual LLMs on complex, locally grounded health consultation tasks. Samiksha establishes a reproducible, scalable paradigm for community-embedded LLM assessment.
📝 Abstract
Large Language Models (LLMs) are typically evaluated through general or domain-specific benchmarks testing capabilities that often lack grounding in the lived realities of end users. Critical domains such as healthcare require evaluations that extend beyond artificial or simulated tasks to reflect the everyday needs, cultural practices, and nuanced contexts of communities. We propose Samiksha, a community-driven evaluation pipeline co-created with civil-society organizations (CSOs) and community members. Our approach enables scalable, automated benchmarking through a culturally aware, community-driven pipeline in which community feedback informs what to evaluate, how the benchmark is built, and how outputs are scored. We demonstrate this approach in the health domain in India. Our analysis highlights how current multilingual LLMs address nuanced community health queries, while also offering a scalable pathway for contextually grounded and inclusive LLM evaluation.