Large Language Models for Spreadsheets: Benchmarking Progress and Evaluating Performance with FLARE

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Current large language models (LLMs) exhibit insufficient precise symbolic reasoning capabilities for spreadsheet tasks, particularly suffering from “hallucinatory” errors in multi-step logical formula generation and structured data manipulation. To address this gap, we propose FLARE—the first comprehensive, spreadsheet-oriented benchmark designed to evaluate rigorous logical reasoning in realistic office scenarios. FLARE comprises three task categories: formula generation, data manipulation, and logical auditing. It integrates both synthetically constructed and real-world spreadsheet cases into a hierarchical task suite and introduces a dual-verification mechanism to ensure correctness and robustness. Experimental results demonstrate that while state-of-the-art LLMs achieve strong performance on simple formula tasks, their accuracy degrades substantially on complex, multi-step reasoning tasks—revealing critical deficiencies in structured-data reasoning. This work establishes a novel evaluation paradigm and provides a foundational benchmark for advancing the reliability and trustworthiness of LLMs in spreadsheet and structured-data applications.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated some significant capabilities across various domains; however, their effectiveness in spreadsheet related tasks remains underexplored. This study introduces a foundation for a comprehensive benchmark framework to evaluate the performance of leading LLMs in executing spreadsheet functions, formula generation and data manipulation tasks. The benchmark encompasses tasks ranging from basic formula creation to complex, real world spreadsheet scenarios. Our findings reveal that while LLMs exhibit proficiency in straightforward tasks, they often falter in complex, multi step operations, frequently producing plausible yet incorrect outputs. These results underscore the limitations of current LLMs in handling spreadsheet tasks that require precise logical reasoning and highlight the need for integrating symbolic reasoning capabilities into LLM architectures. To support this, we introduce FLARE (Formula Logic, Auditing, Reasoning and Evaluation) a new benchmark for evaluating LLM performance on real-world spreadsheet logic, auditing, and reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM performance on spreadsheet tasks

Assessing formula generation and data manipulation

Identifying limitations in complex spreadsheet operations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark framework for LLMs

Evaluates formula generation and data manipulation

Introduces FLARE for logic and reasoning tasks

🔎 Similar Papers

SheetAgent: Towards A Generalist Agent for Spreadsheet Reasoning and Manipulation via Large Language Models