🤖 AI Summary
Existing LLM safety evaluations focus on generic risks, overlooking context-dependent, individualized harms in high-stakes domains (e.g., finance, healthcare) and lack user-wellbeing–centered assessment frameworks.
Method: We propose a multi-user-profile–driven, context-aware evaluation paradigm: (1) constructing a vulnerability-tiered framework grounded in real-world user profiles; (2) integrating expert annotation with controllable prompt engineering; and (3) conducting empirical evaluation across GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro.
Contribution/Results: We first expose that “context-blind” evaluation significantly overestimates safety. We demonstrate that relying solely on user-initiated disclosure is insufficient for robust risk detection. Vulnerable users’ safety scores drop sharply—from 5/7 to 3/7—highlighting critical individual-level disparities. Crucially, we show that generic risk taxonomies cannot substitute for personalized assessment. To advance the field, we publicly release our dataset and codebase.
📝 Abstract
Safety evaluations of large language models (LLMs) typically focus on universal risks like dangerous capabilities or undesirable propensities. However, millions use LLMs for personal advice on high-stakes topics like finance and health, where harms are context-dependent rather than universal. While frameworks like the OECD's AI classification recognize the need to assess individual risks, user-welfare safety evaluations remain underdeveloped. We argue that developing such evaluations is non-trivial due to fundamental questions about accounting for user context in evaluation design. In this exploratory study, we evaluated advice on finance and health from GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across user profiles of varying vulnerability. First, we demonstrate that evaluators must have access to rich user context: identical LLM responses were rated significantly safer by context-blind evaluators than by those aware of user circumstances, with safety scores for high-vulnerability users dropping from safe (5/7) to somewhat unsafe (3/7). One might assume this gap could be addressed by creating realistic user prompts containing key contextual information. However, our second study challenges this: we rerun the evaluation on prompts containing context users report they would disclose, finding no significant improvement. Our work establishes that effective user-welfare safety evaluation requires evaluators to assess responses against diverse user profiles, as realistic user context disclosure alone proves insufficient, particularly for vulnerable populations. By demonstrating a methodology for context-aware evaluation, this study provides both a starting point for such assessments and foundational evidence that evaluating individual welfare demands approaches distinct from existing universal-risk frameworks. We publish our code and dataset to aid future developments.