Can large language models assist choice modelling? Insights into prompting strategies and current models capabilities

📅 2025-07-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the underexplored potential of large language models (LLMs) in constructing multinomial logit (MNL) choice models—a core task in behavioral modeling and discrete choice analysis. Method: We systematically evaluate 13 state-of-the-art LLMs (including GPT, Claude, and Llama variants) across diverse data conditions using structured prompting strategies—zero-shot and chain-of-thought—to assess their capabilities in model specification, parameter estimation, and decision support. Contribution/Results: Structured prompting substantially improves output quality and reliability. Certain models (e.g., GPT-o3) autonomously generate and execute estimation code; Claude 4 Sonnet achieves the best fit but with higher computational complexity, while GPT-series models exhibit superior stability. Closed-source models consistently outperform open-source counterparts. Notably, several LLMs infer optimal model specifications directly from data dictionaries—and even demonstrate robust reasoning when original choice data is unavailable. This work establishes the first empirical foundation and methodological framework for leveraging LLMs in behavioral model construction.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are widely used to support various workflows across different disciplines, yet their potential in choice modelling remains relatively unexplored. This work examines the potential of LLMs as assistive agents in the specification and, where technically feasible, estimation of Multinomial Logit models. We implement a systematic experimental framework involving thirteen versions of six leading LLMs (ChatGPT, Claude, DeepSeek, Gemini, Gemma, and Llama) evaluated under five experimental configurations. These configurations vary along three dimensions: modelling goal (suggesting vs. suggesting and estimating MNLs); prompting strategy (Zero-Shot vs. Chain-of-Thoughts); and information availability (full dataset vs. data dictionary only). Each LLM-suggested specification is implemented, estimated, and evaluated based on goodness-of-fit metrics, behavioural plausibility, and model complexity. Findings reveal that proprietary LLMs can generate valid and behaviourally sound utility specifications, particularly when guided by structured prompts. Open-weight models such as Llama and Gemma struggled to produce meaningful specifications. Claude 4 Sonnet consistently produced the best-fitting and most complex models, while GPT models suggested models with robust and stable modelling outcomes. Some LLMs performed better when provided with just data dictionary, suggesting that limiting raw data access may enhance internal reasoning capabilities. Among all LLMs, GPT o3 was uniquely capable of correctly estimating its own specifications by executing self-generated code. Overall, the results demonstrate both the promise and current limitations of LLMs as assistive agents in choice modelling, not only for model specification but also for supporting modelling decision and estimation, and provide practical guidance for integrating these tools into choice modellers' workflows.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' potential in Multinomial Logit model specification and estimation

Evaluating prompting strategies and information availability for LLM performance

Comparing proprietary and open-weight LLMs in choice modelling tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLMs for Multinomial Logit model specification

Evaluating LLMs with structured prompting strategies

LLMs generating and estimating model specifications

🔎 Similar Papers

No similar papers found.

Authors to Follow