Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing algebraic reasoning benchmarks struggle to precisely attribute the failure modes of large language models, as accuracy alone cannot disentangle the effects of specific complexity dimensions. This work proposes the first parameterized framework for generating and verifying algebraic problems with independently controllable complexity across nine dimensions, enabling systematic model evaluation under strict variable isolation. Evaluating seven models ranging from 8B to 235B parameters reveals a universal breakdown when handling 20–30 parallel reasoning branches, exposing an architectural bottleneck. Furthermore, the study identifies a minimal diagnostic set of five core complexity dimensions sufficient to encompass all known failure patterns and demonstrates that working memory capacity—not model scale—is the critical limiting factor in algebraic reasoning performance.

Technology Category

Application Category

📝 Abstract

Algebraic reasoning remains one of the most informative stress tests for large language models, yet current benchmarks provide no mechanism for attributing failure to a specific cause. When a model fails an algebraic problem, a single accuracy score cannot reveal whether the expression was too deeply nested, the operator too uncommon, the intermediate state count too high, or the dependency chain too long. Prior work has studied individual failure modes in isolation, but no framework has varied each complexity factor independently under strict experimental control. No prior system has offered automatic generation and verification of problems of increasing complexity to track model progress over time. We introduce a nine-dimension algebraic complexity framework in which each factor is varied independently while all others are held fixed, with problem generation and verification handled by a parametric pipeline requiring no human annotation. Each dimension is grounded in a documented LLM failure mode and captures a structurally distinct aspect of algebraic difficulty, including expression nesting depth, simultaneous intermediate result count, sub-expression complexity, operator hardness, and dependent reasoning chain length. We evaluated seven instruction-tuned models spanning 8B to 235B parameters across all nine dimensions and find that working memory is the dominant scale-invariant bottleneck. Every model collapses between 20 and 30 parallel branches regardless of parameter count, pointing to a hard architectural constraint rather than a solvable capacity limitation. Our analysis further identifies a minimal yet diagnostically sufficient subset of five dimensions that together span the full space of documented algebraic failure modes, providing a complete complexity profile of a model's algebraic reasoning capacity.

Problem

Research questions and friction points this paper is trying to address.

algebraic reasoning

complexity dimensions

failure attribution

large language models

working memory bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

algebraic reasoning

complexity dimensions

failure diagnosis