Aligning Large Language Models with Implicit Preferences from User-Generated Content

๐Ÿ“… 2025-06-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing preference alignment methods heavily rely on costly human annotations or strong model distillation, limiting scalability and practical deployment. To address this, we propose leveraging massive unlabeled user-generated content (UGC) as an implicit source of human preferences. We introduce the Preference from UGC (PUGC) frameworkโ€”the first to automatically parse UGC into query-response-score triplets for end-to-end implicit preference modeling. PUGC integrates instruction tuning, Direct Preference Optimization (DPO), semantic parsing, and reference-response scoring to enable controllable-length reward learning. On AlpacaEval 2, PUGC achieves a 9.37% absolute improvement over baseline methods. For Mistral-7B-Instruct, it attains a state-of-the-art 35.93% win rate in length-controlled response preference evaluation. Moreover, PUGC significantly enhances reward quality, cross-domain robustness, and theory-of-mind capabilities.

Technology Category

Application Category

๐Ÿ“ Abstract
Learning from preference feedback is essential for aligning large language models (LLMs) with human values and improving the quality of generated responses. However, existing preference learning methods rely heavily on curated data from humans or advanced LLMs, which is costly and difficult to scale. In this work, we present PUGC, a novel framework that leverages implicit human Preferences in unlabeled User-Generated Content (UGC) to generate preference data. Although UGC is not explicitly created to guide LLMs in generating human-preferred responses, it often reflects valuable insights and implicit preferences from its creators that has the potential to address readers' questions. PUGC transforms UGC into user queries and generates responses from the policy model. The UGC is then leveraged as a reference text for response scoring, aligning the model with these implicit preferences. This approach improves the quality of preference data while enabling scalable, domain-specific alignment. Experimental results on Alpaca Eval 2 show that models trained with DPO and PUGC achieve a 9.37% performance improvement over traditional methods, setting a 35.93% state-of-the-art length-controlled win rate using Mistral-7B-Instruct. Further studies highlight gains in reward quality, domain-specific alignment effectiveness, robustness against UGC quality, and theory of mind capabilities. Our code and dataset are available at https://zhaoxuan.info/PUGC.github.io/
Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with implicit human preferences from unlabeled user content
Reducing reliance on costly curated preference data for LLM training
Improving response quality via scalable domain-specific alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages implicit preferences in user-generated content
Transforms UGC into queries and reference responses
Improves preference data quality and scalability
Zhaoxuan Tan
Zhaoxuan Tan
University of Notre Dame
Natural Language ProcessingPersonalizationSocial Network AnalysisKnowledge Graph
Z
Zheng Li
Amazon.com Inc
T
Tianyi Liu
Amazon.com Inc
H
Haodong Wang
Amazon.com Inc
Hyokun Yun
Hyokun Yun
Machine Learning Scientist at Amazon
Machine LearningStatisticsArtificial IntelligenceNatural Language ProcessingOptimization
M
Ming Zeng
Amazon.com Inc
P
Pei Chen
Amazon.com Inc
Zhihan Zhang
Zhihan Zhang
PhD student, University of Notre Dame
Natural Language Processing
Y
Yifan Gao
Amazon.com Inc
R
Ruijie Wang
Amazon.com Inc
P
Priyanka Nigam
Amazon.com Inc
Bing Yin
Bing Yin
Amazon.com
NLPInformation RetrievalDeep LearningKnowledge Graphs
M
Meng Jiang
University of Notre Dame, Amazon.com Inc