Transformer Neural Processes - Kernel Regression

📅 2024-11-19
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Neural Processes (NPs) suffer from an O(n²) computational bottleneck due to self-attention, limiting scalability for large-scale stochastic process modeling. To address this, we propose a scalable NP framework grounded in kernel-based inductive biases. Our method introduces: (1) a novel Kernel Regression Block (KRB) that explicitly encodes prior kernel structure; (2) Scan Attention (SA) and Deep Kernel Attention (DKA), jointly modeling implicit distances and explicit kernel-induced biases while preserving translation invariance and achieving O(n) linear complexity; and (3) a lightweight Transformer architecture. On a single 24GB GPU, our model performs posterior inference over 100K context points and 1M test points within one minute. We achieve state-of-the-art performance across diverse tasks—including meta-regression, Bayesian optimization, image completion, and epidemiological forecasting—demonstrating both computational efficiency and strong generalization.

Technology Category

Application Category

📝 Abstract
Neural Processes (NPs) are a rapidly evolving class of models designed to directly model the posterior predictive distribution of stochastic processes. Originally developed as a scalable alternative to Gaussian Processes (GPs), which are limited by $mathcal{O}(n^3)$ runtime complexity, the most accurate modern NPs can often rival GPs but still suffer from an $mathcal{O}(n^2)$ bottleneck due to their attention mechanism. We introduce the Transformer Neural Process - Kernel Regression (TNP-KR), a scalable NP featuring: (1) a Kernel Regression Block (KRBlock), a simple, extensible, and parameter efficient transformer block with complexity $mathcal{O}(n_c^2 + n_c n_t)$, where $n_c$ and $n_t$ are the number of context and test points, respectively; (2) a kernel-based attention bias; and (3) two novel attention mechanisms: scan attention (SA), a memory-efficient scan-based attention that when paired with a kernel-based bias can make TNP-KR translation invariant, and deep kernel attention (DKA), a Performer-style attention that implicitly incoporates a distance bias and further reduces complexity to $mathcal{O}(n_c)$. These enhancements enable both TNP-KR variants to perform inference with 100K context points on over 1M test points in under a minute on a single 24GB GPU. On benchmarks spanning meta regression, Bayesian optimization, image completion, and epidemiology, TNP-KR with DKA outperforms its Performer counterpart on nearly every benchmark, while TNP-KR with SA achieves state-of-the-art results.
Problem

Research questions and friction points this paper is trying to address.

Scalable modeling of posterior predictive distribution
Reducing O(n^2) complexity in Neural Processes
Enhancing attention mechanisms for efficient inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Kernel Regression Block reduces complexity
Scan Attention ensures memory efficiency
Deep Kernel Attention minimizes computational overhead
🔎 Similar Papers
No similar papers found.