AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the high cost and poor interpretability of manually designed reward rules in Reinforcement Learning from Human Feedback (RLHF). We propose the first fully automated framework for rule extraction and reward construction: (1) a reasoning model parses human preference feedback; (2) chain-of-thought prompting generates candidate rules; and (3) these are synthesized into a unified, verifiable rule set. Crucially, we introduce rule satisfaction rate as a differentiable, verifiable auxiliary reward. Our method integrates a language-model-based rule validator, multi-objective reward fusion, and Llama-3-8B policy fine-tuning. On AlpacaEval 2.0, our approach improves length-control win rate by 28.6%; on MT-Bench, it raises the two-turn average score by 6.1%. It significantly mitigates reward hacking while achieving high consistency between automatically extracted rules and human preferences.

Technology Category

Application Category

📝 Abstract

Rule-based rewards offer a promising strategy for improving reinforcement learning from human feedback (RLHF), but current approaches often rely on manual rule engineering. We present AutoRule, a fully automated method for extracting rules from preference feedback and formulating them into rule-based rewards. AutoRule extraction operates in three stages: it leverages a reasoning model to interpret user preferences, identifies candidate rules from the reasoning chain of these interpretations, and synthesizes them into a unified rule set. Leveraging the finalized rule set, we employ language-model verifiers to compute the fraction of rules satisfied by each output, using this metric as an auxiliary reward alongside the learned reward model during policy optimization. Training a Llama-3-8B model with AutoRule results in a 28.6% relative improvement in length-controlled win rate on AlpacaEval2.0, and a 6.1% relative gain in second-turn performance on a held-out MT-Bench subset, compared to a GRPO baseline trained with the same learned reward model but without the rule-based auxiliary reward. Our analysis confirms that the extracted rules exhibit good agreement with dataset preference. We find that AutoRule demonstrates reduced reward hacking compared to a learned reward model when run over two episodes. Finally, our case study suggests that the extracted rules capture unique qualities valued in different datasets. The extracted rules are provided in the appendix, and the code is open-sourced at https://github.com/cxcscmu/AutoRule.

Problem

Research questions and friction points this paper is trying to address.

Automates rule extraction from human feedback for RLHF

Improves reinforcement learning with rule-based auxiliary rewards

Reduces reward hacking compared to learned reward models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated rule extraction from preference feedback

Reasoning model interprets user preferences

Rule-based auxiliary reward improves policy optimization

🔎 Similar Papers

No similar papers found.