PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model

📅 2025-05-06

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

In multi-objective test-time alignment, existing approaches—such as GenARM—rely on multiple independent autoregressive reward models (ARMs), incurring high inference overhead and inconsistent cross-dimensional preference modeling, while requiring the large language model (LLM) to remain frozen. To address this, we propose: (1) Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA), the first method enabling a single ARM to jointly model multi-dimensional user preference vectors; and (2) a weak-to-strong test-time guidance paradigm, wherein a compact ARM efficiently steers a frozen, large-scale LLM during generation. Experiments demonstrate that our approach significantly reduces inference cost, improves alignment accuracy of preference vectors, and enables flexible, real-time trade-off control and controllable generation from strong LLMs under limited computational resources.

Technology Category

Application Category

📝 Abstract

Multi-objective test-time alignment aims to adapt large language models (LLMs) to diverse multi-dimensional user preferences during inference while keeping LLMs frozen. Recently, GenARM (Xu et al., 2025) first independently trains Autoregressive Reward Models (ARMs) for each preference dimension without awareness of each other, then combines their outputs based on user-specific preference vectors during inference to achieve multi-objective test-time alignment, leading to two key limitations: the need for extit{multiple} ARMs increases the inference cost, and the separate training of ARMs causes the misalignment between the guided generation and the user preferences. To address these issues, we propose Preference-aware ARM (PARM), a single unified ARM trained across all preference dimensions. PARM uses our proposed Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA), which employs a bilinear form to condition the ARM on preference vectors, enabling it to achieve precise control over preference trade-offs during inference. Experiments demonstrate that PARM reduces inference costs and achieves better alignment with preference vectors compared with existing methods. Additionally, PARM enables weak-to-strong guidance, allowing a smaller PARM to guide a larger frozen LLM without expensive training, making multi-objective alignment accessible with limited computing resources. The code is available at https://github.com/Baijiong-Lin/PARM.

Problem

Research questions and friction points this paper is trying to address.

Reduces inference cost by unifying multiple reward models

Improves alignment with user preferences during generation

Enables weak-to-strong guidance for resource-efficient adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single unified ARM for multi-preference alignment

Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA)

Weak-to-strong guidance with smaller PARM

🔎 Similar Papers

GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment