CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the computational inefficiency of Prior-fitted Networks (PFNs) under large-scale training data, which stems from the quadratic complexity of self-attention during inference. To overcome this limitation, the authors propose CRUMB, a three-stage inference framework that first clusters test queries, then greedily selects a training subset for each cluster by minimizing the maximum mean discrepancy (MMD) to align distributions, and finally performs exact PFN inference within the reduced context. CRUMB requires no retraining, is architecture-agnostic, and inherently adapts to covariate shift. Evaluated on the TabArena 51 benchmark, CRUMB consistently outperforms existing context selection methods across three distinct PFN architectures, achieving superior trade-offs between inference efficiency and predictive performance.
📝 Abstract
Prior-fitted networks (PFNs) are a promising class of tabular foundation models that perform in-context learning, whereby the entire labelled training set is supplied as context, and predictions for test queries are produced in a single forward pass. However, the quadratically scaling self-attention mechanism in many PFN architectures makes inference prohibitive for very large training datasets. We propose CRUMB (Clustered Retrieval Using Minimised-MMD Batching), a three-stage inference wrapper that (i) clusters the test queries, (ii) selects a small, distributionally matched training subset for each cluster by greedily minimising the maximum mean discrepancy (MMD), and (iii) runs exact PFN inference on each reduced-context batch. CRUMB is architecture-agnostic and requires no retraining. On the 51-dataset TabArena benchmark, evaluated across three PFN architectures (TabPFNv2, TabICLv1, TabICLv2), we show that CRUMB outperforms similar state-of-the-art context selection strategies. We also show that CRUMB is resilient to covariate drift, as the MMD-minimisation step naturally helps align the training context distribution to match the current test batch distributions.
Problem

Research questions and friction points this paper is trying to address.

Prior-fitted networks
in-context learning
scalability
self-attention
tabular foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prior-fitted networks
Context batching
Maximum Mean Discrepancy
In-context learning
Distributional matching