Re-Centering Humans in LLM Personalization

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current evaluations of personalization in large language models predominantly rely on synthetic data, which often fails to capture real-world user scenarios. This work introduces the first large-scale personalized evaluation benchmark grounded in 550 authentic human conversations and multi-stage human annotations, systematically assessing model performance across three stages: attribute extraction, relevance matching, and personalized response generation. The study reveals that models significantly diverge from human judgments at all stages, with personalized responses not perceived by users as superior to generic ones. To address this, the authors propose a lightweight training intervention that effectively improves alignment with human preferences in the first two stages; however, reward modeling in the final stage still exhibits limited correlation with human ratings.

📝 Abstract

Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.

Problem

Research questions and friction points this paper is trying to address.

LLM personalization

human evaluation

synthetic data

user attributes

personalized response

Innovation

Methods, ideas, or system contributions that make the work stand out.

human-centered personalization

large language models

synthetic vs human data

personalization evaluation

reward modeling

🔎 Similar Papers

No similar papers found.