PREF: Reference-Free Evaluation of Personalised Text Generation in LLMs

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Personalized text generation evaluation has long suffered from neglecting user individuality and reliance on manually crafted, gold-standard personalized references. To address this, we propose PREF, the first end-to-end evaluation framework that operates without gold personalized references. PREF decouples universal quality assessment—encompassing factual consistency and coherence—from user preference modeling, which leverages user profiles and contextual cues for re-ranking and enhancement. It further employs large language models (LLMs) to autonomously generate task-specific instructions and perform judge-style scoring, yielding interpretable, robust, and cross-model reusable scores. Evaluated on the PrefEval benchmark, PREF significantly outperforms existing baselines in accuracy and calibration, achieving strong agreement with human judgments. This advances the evaluability of personalized generation systems.

Technology Category

Application Category

📝 Abstract

Personalised text generation is essential for user-centric information systems, yet most evaluation methods overlook the individuality of users. We introduce extbf{PREF}, a extbf{P}ersonalised extbf{R}eference-free extbf{E}valuation extbf{F}ramework that jointly measures general output quality and user-specific alignment without requiring gold personalised references. PREF operates in a three-step pipeline: (1) a coverage stage uses a large language model (LLM) to generate a comprehensive, query-specific guideline covering universal criteria such as factuality, coherence, and completeness; (2) a preference stage re-ranks and selectively augments these factors using the target user's profile, stated or inferred preferences, and context, producing a personalised evaluation rubric; and (3) a scoring stage applies an LLM judge to rate candidate answers against this rubric, ensuring baseline adequacy while capturing subjective priorities. This separation of coverage from preference improves robustness, transparency, and reusability, and allows smaller models to approximate the personalised quality of larger ones. Experiments on the PrefEval benchmark, including implicit preference-following tasks, show that PREF achieves higher accuracy, better calibration, and closer alignment with human judgments than strong baselines. By enabling scalable, interpretable, and user-aligned evaluation, PREF lays the groundwork for more reliable assessment and development of personalised language generation systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating personalized text generation without gold references

Measuring output quality and user alignment jointly

Creating scalable and interpretable personalized evaluation frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM generates query-specific universal guidelines

User profile augments criteria for personalization

LLM judges score against personalized rubric

🔎 Similar Papers

No similar papers found.

OpenAI

$380K – $445K • Offers Equity

San Francisco, CA, USA

Research Engineer, Language - Personalization, Meta Superintelligence Labs