Predicting Language Models' Success at Zero-Shot Probabilistic Prediction

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In zero-shot probabilistic prediction on tabular data, users lack reliable means to anticipate large language models’ (LLMs) performance on specific tasks without ground-truth labels. Method: We propose a task-level performance prediction framework that requires no labeled data. It extracts unsupervised meta-features from the model’s own outputs—such as predicted probability distributions and confidence consistency—and leverages large-scale empirical analysis coupled with task-level meta-modeling to quantify LLMs’ zero-shot predictive capability. Contribution/Results: First, we systematically reveal the high performance variability of LLMs across tabular prediction tasks. Second, we introduce generalizable, plug-and-play unsupervised metrics that reliably predict accuracy on unseen tasks (Pearson correlation ≥ 0.72). Third, we empirically validate that raw predicted probabilities serve as strong individual-accuracy signals even for foundational tasks—providing a principled basis for model suitability assessment in zero-shot settings.

Technology Category

Application Category

📝 Abstract
Recent work has investigated the capabilities of large language models (LLMs) as zero-shot models for generating individual-level characteristics (e.g., to serve as risk models or augment survey datasets). However, when should a user have confidence that an LLM will provide high-quality predictions for their particular task? To address this question, we conduct a large-scale empirical study of LLMs' zero-shot predictive capabilities across a wide range of tabular prediction tasks. We find that LLMs' performance is highly variable, both on tasks within the same dataset and across different datasets. However, when the LLM performs well on the base prediction task, its predicted probabilities become a stronger signal for individual-level accuracy. Then, we construct metrics to predict LLMs' performance at the task level, aiming to distinguish between tasks where LLMs may perform well and where they are likely unsuitable. We find that some of these metrics, each of which are assessed without labeled data, yield strong signals of LLMs' predictive performance on new tasks.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' zero-shot predictive performance variability
Identifying when LLMs provide high-quality individual predictions
Developing metrics to predict LLM suitability for tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale empirical study across tabular tasks
Constructed metrics to predict task-level performance
Assessed metrics without labeled data signals
🔎 Similar Papers
No similar papers found.
Kevin Ren
Kevin Ren
Graduate Student, Princeton University
Harmonic analysisMetric geometryIncidence geometry
S
Santiago Cortes-Gomez
Carnegie Mellon University
C
Carlos Miguel Patiño
Carnegie Mellon University
A
Ananya Joshi
Carnegie Mellon University
R
Ruiqi Lyu
Carnegie Mellon University
Jingjing Tang
Jingjing Tang
Southwestern University of Finance and Economics
machine learning
A
Alistair Turcan
Carnegie Mellon University
K
Khurram Yamin
Carnegie Mellon University
S
Steven Wu
Carnegie Mellon University
Bryan Wilder
Bryan Wilder
Assistant Professor of Machine Learning, Carnegie Mellon University
Artificial intelligenceoptimizationmachine learningsocial networks