OpenRFM: Dissecting Relational In-Context Learning

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the significant performance gap between open-source relation foundation models (RFMs) and commercial counterparts, primarily attributed to label sparsity and distributional biases in pretraining data. The study reveals that conventional relation Transformers struggle to capture identifiable latent relational variables during in-context learning. To overcome this limitation, the authors propose a two-stage architecture: first extracting structured representations via a relation backbone network, then enhancing inference through a batch-level in-context learning layer. Furthermore, they introduce a homogeneity-aware synthetic-to-real data mixing strategy and a prototype-regularized pretraining approach to improve relational generalization. Evaluated across multiple benchmark tasks, the proposed method achieves an average performance gain of approximately 30% and, for the first time, surpasses the commercial model KumoRFMv1.

📝 Abstract

Relational Foundation Models (RFMs) promise a single pre-trained predictor that, given any relational database, returns predictions in one forward pass via relational in-context learning (ICL). Yet a substantial gap separates open RFMs from their commercial counterparts, and the origin of this gap has not been systematically understood. We dissect a representative framework, the Relational Transformer (RT), from two perspectives. Model side: we show that RT performs relation-level ICL, and a kernel regression view shows it fails when sparse label-cell coverage yields an underdetermined regression. Data side: we ablate RT's pre-training source and find that existing synthetic-only pre-training and in-distribution pre-training drive the same architecture into different regimes, lazy vs. feature-learning. Probing this gap reveals that the missing ingredient is a support-identifiable relational latent in the label-generation process. These two diagnoses translate into (1) a dual-stage ICL architecture that combines the relational backbone with a batch-level ICL layer lifted from a pre-trained tabular foundation model to overcome relation-level label scarcity, and (2) a homophily-aware synthetic plus continual real-data pre-training mixture, augmented with a prototype-based regularization. These choices define OpenRFM, a simple yet effective RFM that improves average task performance by approximately 30% over the RT backbone and surpasses the commercial model KumoRFMv1 on a large set of evaluation tasks.

Problem

Research questions and friction points this paper is trying to address.

Relational Foundation Models

relational in-context learning

performance gap

pre-training

label scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

relational in-context learning

dual-stage ICL architecture

homophily-aware pre-training