🤖 AI Summary
This work addresses the limitation of existing structured-data modeling approaches—namely, their reliance on task-specific architectures and fine-tuning—by proposing LimiX, a foundation model for general-purpose intelligence over structured data. LimiX models structured data as the joint distribution of variables and missingness patterns, enabling unified conditional prediction across diverse tasks—including classification, regression, imputation, and synthetic data generation—via query-based inference. Its key innovations are (i) masked joint-distribution pretraining and (ii) a context-aware conditional prediction mechanism, which together support training-free adaptation and rapid cross-dataset generalization. Evaluated on ten large-scale benchmarks, LimiX consistently outperforms gradient-boosted trees, deep tabular models, state-of-the-art tabular foundation models, and AutoML systems. It is the first model to achieve universal, zero-shot, multi-task inference over structured data with a single architecture and no task-specific fine-tuning.
📝 Abstract
We argue that progress toward general intelligence requires complementary foundation models grounded in language, the physical world, and structured data. This report presents LimiX, the first installment of our large structured-data models (LDMs). LimiX treats structured data as a joint distribution over variables and missingness, thus capable of addressing a wide range of tabular tasks through query-based conditional prediction via a single model. LimiX is pretrained using masked joint-distribution modeling with an episodic, context-conditional objective, where the model predicts for query subsets conditioned on dataset-specific contexts, supporting rapid, training-free adaptation at inference. We evaluate LimiX across 10 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios. With a single model and a unified interface, LimiX consistently surpasses strong baselines including gradient-boosting trees, deep tabular networks, recent tabular foundation models, and automated ensembles, as shown in Figure 1 and Figure 2. The superiority holds across a wide range of tasks, such as classification, regression, missing value imputation, and data generation, often by substantial margins, while avoiding task-specific architectures or bespoke training per task. All LimiX models are publicly accessible under Apache 2.0.