Decoupled-Value Attention for Prior-Data Fitted Networks: GP Inference for Physical Equations

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

To address the limited performance of Prior-data Fitted Networks (PFNs) in high-dimensional physics equation regression, this paper proposes the Decoupled-Value Attention (DVA) mechanism—introducing, for the first time, the independence principle from Gaussian process (GP) kernel functions into attention modeling. DVA explicitly decouples input similarity computation from label propagation, revealing that attention design—not network architecture—is the dominant factor governing PFN performance. Consequently, it enables efficient, GP-style inference without explicit kernel specification. DVA is architecture-agnostic and seamlessly integrates into both Transformer and CNN backbones. Empirically, it reduces prediction loss by over 50% on 5D and 10D regression tasks, achieves a mean absolute error of ~1×10⁻³ on a 64D power system PDE approximation, and accelerates inference by more than 80× compared to exact GP inference.

Technology Category

Application Category

📝 Abstract

Prior-data fitted networks (PFNs) are a promising alternative to time-consuming Gaussian Process (GP) inference for creating fast surrogates of physical systems. PFN reduces the computational burden of GP-training by replacing Bayesian inference in GP with a single forward pass of a learned prediction model. However, with standard Transformer attention, PFNs show limited effectiveness on high-dimensional regression tasks. We introduce Decoupled-Value Attention (DVA)-- motivated by the GP property that the function space is fully characterized by the kernel over inputs and the predictive mean is a weighted sum of training targets. DVA computes similarities from inputs only and propagates labels solely through values. Thus, the proposed DVA mirrors the Gaussian-process update while remaining kernel-free. We demonstrate that the crucial factor for scaling PFNs is the attention rule rather than the architecture itself. Specifically, our results demonstrate that (a) localized attention consistently reduces out-of-sample validation loss in PFNs across different dimensional settings, with validation loss reduced by more than 50% in five- and ten-dimensional cases, and (b) the role of attention is more decisive than the choice of backbone architecture, showing that CNN-based PFNs can perform at par with their Transformer-based counterparts. The proposed PFNs provide 64-dimensional power flow equation approximations with a mean absolute error of the order of 1E-3, while being over 80x faster than exact GP inference.

Problem

Research questions and friction points this paper is trying to address.

Improving prior-data fitted networks for high-dimensional regression tasks

Reducing computational burden of Gaussian Process inference for physical systems

Developing efficient attention mechanisms for scaling PFN performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled-Value Attention computes similarities from inputs only

Propagates labels solely through values without kernel dependency

Localized attention reduces validation loss by over 50%

🔎 Similar Papers

No similar papers found.