🤖 AI Summary
Current evaluations of personalization in large language models predominantly rely on synthetic data, which often fails to capture real-world user scenarios. This work introduces the first large-scale personalized evaluation benchmark grounded in 550 authentic human conversations and multi-stage human annotations, systematically assessing model performance across three stages: attribute extraction, relevance matching, and personalized response generation. The study reveals that models significantly diverge from human judgments at all stages, with personalized responses not perceived by users as superior to generic ones. To address this, the authors propose a lightweight training intervention that effectively improves alignment with human preferences in the first two stages; however, reward modeling in the final stage still exhibits limited correlation with human ratings.
📝 Abstract
Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.