Constructing a Portfolio Optimization Benchmark Framework for Evaluating Large Language Models

πŸ“… 2026-03-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the lack of systematic evaluation of large language models’ (LLMs) reasoning capabilities in quantitative investment decision-making, as existing financial benchmarks primarily focus on language understanding. The authors propose the first scalable multiple-choice benchmark derived from portfolio optimization problems with analytical solutions, grounded in modern portfolio theory. By parameterizing objective functions and constraints, they generate a diverse test set that directly assesses LLMs’ ability to reason about financial optimization tasks. Experimental results reveal distinct performance patterns: GPT-4 demonstrates superior and robust performance under risk-oriented objectives, Gemini 1.5 Pro excels in return-focused scenarios but is sensitive to constraint variations, and Llama 3.1-70B exhibits comparatively weaker overall performance. These findings highlight both the disparities and limitations of current LLMs in quantitative financial reasoning.

Technology Category

Application Category

πŸ“ Abstract
This study introduces a benchmark framework for evaluating the financial decision-making capabilities of large language models (LLMs) through portfolio optimization problems with mathematically explicit solutions. Unlike existing financial benchmarks that emphasize language-processing tasks, the proposed framework directly tests optimization-based reasoning in investment contexts. A large set of multiple-choice questions is generated by varying objectives, candidate assets, and investment constraints, with each problem designed to include a unique correct solution and systematically constructed alternatives. Experimental results comparing GPT-4, Gemini 1.5 Pro, and Llama 3.1-70B reveal distinct performance patterns: GPT achieves the highest accuracy in risk-based objectives and remains stable under constraints, Gemini performs well in return-based tasks but struggles under other conditions, and Llama records the lowest overall performance. These findings highlight both the potential and current limitations of LLMs in applying quantitative reasoning to finance, while providing a scalable foundation for developing LLM-based services in portfolio management.
Problem

Research questions and friction points this paper is trying to address.

portfolio optimization
large language models
financial decision-making
benchmark framework
quantitative reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

portfolio optimization
large language models
financial benchmarking
quantitative reasoning
constrained optimization
πŸ”Ž Similar Papers
No similar papers found.
H
Hanyong Cho
Graduate School of Management of Technology, Korea University
Jang Ho Kim
Jang Ho Kim
Graduate School of Management of Technology, Korea University