Challenges of Evaluating LLM Safety for User Welfare

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM safety evaluations focus on generic risks, overlooking context-dependent, individualized harms in high-stakes domains (e.g., finance, healthcare) and lack user-wellbeing–centered assessment frameworks. Method: We propose a multi-user-profile–driven, context-aware evaluation paradigm: (1) constructing a vulnerability-tiered framework grounded in real-world user profiles; (2) integrating expert annotation with controllable prompt engineering; and (3) conducting empirical evaluation across GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro. Contribution/Results: We first expose that “context-blind” evaluation significantly overestimates safety. We demonstrate that relying solely on user-initiated disclosure is insufficient for robust risk detection. Vulnerable users’ safety scores drop sharply—from 5/7 to 3/7—highlighting critical individual-level disparities. Crucially, we show that generic risk taxonomies cannot substitute for personalized assessment. To advance the field, we publicly release our dataset and codebase.

Technology Category

Application Category

📝 Abstract

Safety evaluations of large language models (LLMs) typically focus on universal risks like dangerous capabilities or undesirable propensities. However, millions use LLMs for personal advice on high-stakes topics like finance and health, where harms are context-dependent rather than universal. While frameworks like the OECD's AI classification recognize the need to assess individual risks, user-welfare safety evaluations remain underdeveloped. We argue that developing such evaluations is non-trivial due to fundamental questions about accounting for user context in evaluation design. In this exploratory study, we evaluated advice on finance and health from GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across user profiles of varying vulnerability. First, we demonstrate that evaluators must have access to rich user context: identical LLM responses were rated significantly safer by context-blind evaluators than by those aware of user circumstances, with safety scores for high-vulnerability users dropping from safe (5/7) to somewhat unsafe (3/7). One might assume this gap could be addressed by creating realistic user prompts containing key contextual information. However, our second study challenges this: we rerun the evaluation on prompts containing context users report they would disclose, finding no significant improvement. Our work establishes that effective user-welfare safety evaluation requires evaluators to assess responses against diverse user profiles, as realistic user context disclosure alone proves insufficient, particularly for vulnerable populations. By demonstrating a methodology for context-aware evaluation, this study provides both a starting point for such assessments and foundational evidence that evaluating individual welfare demands approaches distinct from existing universal-risk frameworks. We publish our code and dataset to aid future developments.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM safety for individual user welfare in high-stakes advice contexts

Addressing the insufficiency of context-blind and user-disclosed context evaluations

Developing distinct evaluation approaches for vulnerable populations beyond universal-risk frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluators need rich user context for safety assessments

Realistic user prompts alone do not improve evaluation accuracy

Methodology assesses responses against diverse user vulnerability profiles

🔎 Similar Papers

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

2024-05-23Citations: 7

Authors to Follow