🤖 AI Summary
This study evaluates the performance of mainstream large language models (LLMs) in consumer-grade health Q&A, focusing on real-world, colloquial, and highly uncertain health queries posed by non-expert users (MedRedQA dataset). We propose a novel cross-model comparative evaluation framework that eliminates human annotation bias and enables zero-shot generation and mutual assessment. Experiments involve five models: GPT-4o mini, Llama 3.1-70B, Mistral-123B, Mistral-7B, and Gemini-Flash. Results show that GPT-4o mini achieves the highest overall performance, most closely approximating clinical expert responses; Mistral-7B lags significantly; and all models exhibit systematic limitations in response accuracy, explanatory transparency, and adaptability to informal language. To our knowledge, this is the first work to systematically characterize the capability boundaries and critical deficiencies of LLMs in authentic consumer medical question-answering. It establishes a methodological foundation and provides empirical evidence for developing trustworthy health AI systems.
📝 Abstract
This study evaluates the performance of several Large Language Models (LLMs) on MedRedQA, a dataset of consumer-based medical questions and answers by verified experts extracted from the AskDocs subreddit. While LLMs have shown proficiency in clinical question answering (QA) benchmarks, their effectiveness on real-world, consumer-based, medical questions remains less understood. MedRedQA presents unique challenges, such as informal language and the need for precise responses suited to non-specialist queries. To assess model performance, responses were generated using five LLMs: GPT-4o mini, Llama 3.1: 70B, Mistral-123B, Mistral-7B, and Gemini-Flash. A cross-evaluation method was used, where each model evaluated its responses as well as those of others to minimize bias. The results indicated that GPT-4o mini achieved the highest alignment with expert responses according to four out of the five models' judges, while Mistral-7B scored lowest according to three out of five models' judges. This study highlights the potential and limitations of current LLMs for consumer health medical question answering, indicating avenues for further development.