🤖 AI Summary
Human feedback in large language model (LLM) alignment is costly and poorly scalable. Method: This paper proposes a novel paradigm that replaces human annotators with lightweight, weaker LLMs to provide high-quality alignment feedback. Through systematic empirical studies, we demonstrate—for the first time—that model scale has limited impact on feedback quality; preference labels generated by weaker LLMs match or even surpass human annotations in instruction tuning and reward modeling tasks. We introduce a multidimensional alignment evaluation framework—incorporating quantitative metrics, qualitative analysis, and human comparisons—to rigorously assess feedback consistency, robustness, and generalization. Contribution/Results: Our approach substantially reduces both human labor and computational overhead, enabling sustainable, large-scale LLM alignment. It establishes the theoretical and practical foundation for deploying weak LLMs as low-cost, high-fidelity “alignment teachers.”
📝 Abstract
The burgeoning capabilities of large language models (LLMs) have underscored the need for alignment to ensure these models act in accordance with human values and intentions. Existing alignment frameworks present constraints either in the form of expensive human effort or high computational costs. This paper explores a promising middle ground, where we employ a weak LLM that is significantly less resource-intensive than top-tier models, yet offers more automation than purely human feedback. We present a systematic study to evaluate and understand weak LLM's ability to generate feedback for alignment. Our empirical findings demonstrate that weak LLMs can provide feedback that rivals or even exceeds that of fully human-annotated data. Our study indicates a minimized impact of model size on feedback efficacy, shedding light on a scalable and sustainable alignment strategy. To deepen our understanding of alignment under weak LLM feedback, we conduct a series of qualitative and quantitative analyses, offering novel insights into the quality discrepancies between human feedback vs. weak LLM feedback.