Embedding World Knowledge into Tabular Models: Towards Best Practices for Embedding Pipeline Design

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This study addresses the lack of systematic design guidance for large language model (LLM)-based embedding pipelines in tabular prediction tasks. It presents the first large-scale empirical investigation, systematically evaluating 256 pipeline configurations that combine eight preprocessing strategies, sixteen embedding models, and two downstream models—including gradient-boosted decision trees. The findings reveal that embedding concatenation significantly outperforms column-wise replacement, that larger embedding models consistently yield greater performance gains than those selected solely by popularity, and that integrating gradient-boosted decision trees further enhances predictive accuracy. These results establish reliable design principles and practical guidelines for effectively leveraging LLM-derived embeddings in tabular data prediction.

Technology Category

Application Category

📝 Abstract

Embeddings are a powerful way to enrich data-driven machine learning models with the world knowledge of large language models (LLMs). Yet, there is limited evidence on how to design effective LLM-based embedding pipelines for tabular prediction. In this work, we systematically benchmark 256 pipeline configurations, covering 8 preprocessing strategies, 16 embedding models, and 2 downstream models. Our results show that it strongly depends on the specific pipeline design whether incorporating the prior knowledge of LLMs improves the predictive performance. In general, concatenating embeddings tends to outperform replacing the original columns with embeddings. Larger embedding models tend to yield better results, while public leaderboard rankings and model popularity are poor performance indicators. Finally, gradient boosting decision trees tend to be strong downstream models. Our findings provide researchers and practitioners with guidance for building more effective embedding pipelines for tabular prediction tasks.

Problem

Research questions and friction points this paper is trying to address.

embedding pipeline

tabular prediction

world knowledge

LLM embeddings

pipeline design

Innovation

Methods, ideas, or system contributions that make the work stand out.

embedding pipelines

tabular data

large language models