Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

187K/year
🤖 AI Summary
This work addresses the performance discrepancy between training and testing in deep learning caused by autoregressive roll-out prediction—a phenomenon known as test-time feedback (TTF)—where errors accumulate with task length. To mitigate TTF, the authors propose Dual Preconditioning (DoPr), a novel optimization paradigm that uniquely integrates optimizer design as a key dimension for alleviating this issue. DoPr jointly leverages gradient-level preconditioning (e.g., Adam, Muon) and activation-level preconditioning (e.g., KFAC). Evaluated across diverse TTF-prone settings—including language modeling, generative models, and robotic policies—DoPr consistently enhances test-time performance, such as task success rates and generation quality, even when validation loss shows no significant improvement. This demonstrates its capacity to improve generalization through refined optimization dynamics, highlighting its conceptual and practical innovation.
📝 Abstract
Many modern applications of deep learning involve training a neural network via a one-step prediction loss (e.g., $L^2$ regression, cross-entropy), but deploy the network by rolling out along its own predictions. Key examples include autoregressive language modeling, flow-based generative modeling, and robot policy learning. It is well-documented that these settings induce a phenomenon we call test-time feedback (TTF): the mismatch between the training/validation loss and downstream metrics of interest, such as task success rate and generation quality, which grows with task length. While data curation, architecture, and objective design have been proposed to combat train-test shift in TTF settings, this paper proposes optimization as a new design axis to mitigate error accumulation. Specifically, we introduce a new optimization paradigm called double-preconditioning (DoPr) uniquely tailored to the challenges of TTF. DoPr combines gradient-wise preconditioning, as in Adam and Muon, with activation-wise preconditioning (AP), such as in KFAC. We show that the addition of AP yields a drop-in intervention for increasing downstream model performance across a range of TTF settings. Interestingly, these gains in test-time performance do not consistently accompany improvements in validation loss, opening new questions about how to properly evaluate models trained with one-step supervised objectives.
Problem

Research questions and friction points this paper is trying to address.

test-time feedback
train-test shift
error accumulation
downstream performance
one-step prediction loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Double Preconditioning
Test-Time Feedback
Activation-wise Preconditioning
Optimization for Deployment
Train-Test Mismatch
🔎 Similar Papers
No similar papers found.