Just Because You Can, Doesn't Mean You Should: LLMs for Data Fitting

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work exposes a critical robustness deficiency in large language models (LLMs) when employed as data-fitting tools: they exhibit extreme sensitivity to superficial, task-irrelevant representation changes—such as variable name alterations—inducing prediction error fluctuations of up to 82%. Through attention visualization, in-context learning, and supervised fine-tuning, the study systematically analyzes non-uniform attention distributions over tabular data, revealing for the first time LLMs’ lack of fundamental data invariance. Comparative experiments across multiple open-source LLMs and specialized tabular models (e.g., TabPFN) confirm the ubiquity of this fragility. The core contributions are threefold: (1) a novel robustness evaluation framework tailored to LLM-based data fitting; (2) a mechanistic explanation of sensitivity rooted in attention behavior; and (3) theoretical insights and empirical evidence guiding the development of semantically invariant, trustworthy data-fitting models.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are being applied in a wide array of settings, well beyond the typical language-oriented use cases. In particular, LLMs are increasingly used as a plug-and-play method for fitting data and generating predictions. Prior work has shown that LLMs, via in-context learning or supervised fine-tuning, can perform competitively with many tabular supervised learning techniques in terms of predictive performance. However, we identify a critical vulnerability of using LLMs for data fitting -- making changes to data representation that are completely irrelevant to the underlying learning task can drastically alter LLMs' predictions on the same data. For example, simply changing variable names can sway the size of prediction error by as much as 82% in certain settings. Such prediction sensitivity with respect to task-irrelevant variations manifests under both in-context learning and supervised fine-tuning, for both close-weight and open-weight general-purpose LLMs. Moreover, by examining the attention scores of an open-weight LLM, we discover a non-uniform attention pattern: training examples and variable names/values which happen to occupy certain positions in the prompt receive more attention when output tokens are generated, even though different positions are expected to receive roughly the same attention. This partially explains the sensitivity in the presence of task-irrelevant variations. We also consider a state-of-the-art tabular foundation model (TabPFN) trained specifically for data fitting. Despite being explicitly designed to achieve prediction robustness, TabPFN is still not immune to task-irrelevant variations. Overall, despite LLMs' impressive predictive capabilities, currently they lack even the basic level of robustness to be used as a principled data-fitting tool.

Problem

Research questions and friction points this paper is trying to address.

LLMs show prediction sensitivity to irrelevant data variations

Task-irrelevant changes like variable names alter prediction accuracy

LLMs lack robustness for principled data-fitting applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies LLM prediction sensitivity to data representation

Discovers non-uniform attention patterns in LLM processing

Tests robustness of specialized tabular foundation models

🔎 Similar Papers

No similar papers found.