Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Testing large language models (LLMs) often incurs high alignment costs and relies heavily on fine-tuning. Method: This paper proposes a gradient-free test-time alignment method that operates in the pre-logit space: it applies Gaussian perturbations to hidden states preceding the output layer and employs importance sampling within a stochastic model-predictive control framework to dynamically optimize these representations—maximizing expected reward without modifying model parameters. Contribution/Results: By integrating optimal control principles directly into the decoding process, the method enables efficient sequence re-ranking and generation refinement at test time. Experiments demonstrate that, under identical sample budgets, it significantly outperforms best-of-n sampling and state-of-the-art reward-guided decoding strategies, achieving substantial gains in reward scores. These results validate its strong alignment capability and generalization effectiveness while maintaining low computational overhead.

Technology Category

Application Category

📝 Abstract
Test-time alignment of large language models (LLMs) attracts attention because fine-tuning LLMs requires high computational costs. In this paper, we propose a new test-time alignment method called adaptive importance sampling on pre-logits (AISP) on the basis of the sampling-based model predictive control with the stochastic control input. AISP applies the Gaussian perturbation into pre-logits, which are outputs of the penultimate layer, so as to maximize expected rewards with respect to the mean of the perturbation. We demonstrate that the optimal mean is obtained by importance sampling with sampled rewards. AISP outperforms best-of-n sampling in terms of rewards over the number of used samples and achieves higher rewards than other reward-based test-time alignment methods.
Problem

Research questions and friction points this paper is trying to address.

Aligns LLMs at test time without fine-tuning
Maximizes rewards via pre-logit space perturbations
Outperforms existing sampling-based alignment methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Applies Gaussian perturbation in pre-logit space
Uses importance sampling with sampled rewards
Implements sampling-based model predictive control
🔎 Similar Papers
No similar papers found.
Sekitoshi Kanai
Sekitoshi Kanai
NTT
Deep learningRecurrent neural networkSystem identification
T
Tsukasa Yoshida
NTT, Inc., Toyohashi University of Technology
Hiroshi Takahashi
Hiroshi Takahashi
NTT
Machine LearningDeep Learning
H
Haru Kuroki
The University of Osaka
K
Kazumune Hashimoto
The University of Osaka