🤖 AI Summary
This work evaluates the reliability of large language models (LLMs) in a foundational operations research pedagogical task: converting linear programming primal problems to their duals (P2DC). Addressing high false-positive rates and the absence of formal verification in existing evaluation methods, we propose a novel automatic verification approach based on *canonical graph edit distance* (CGED), integrated within a comprehensive assessment framework supporting instance generation, rigorous correctness validation, and fine-grained error attribution. Our systematic evaluation—first of its kind—reveals that even mainstream open-weight LLMs commit frequent, fundamental errors on minimal two-variable instances. Moreover, they exhibit significant fragility in auxiliary tasks such as correctness judgment and error localization. These findings challenge the prevailing assumption of LLMs’ reliability as optimization teaching aids, providing both theoretical grounding and empirical benchmarks for trustworthy LLM deployment in educational contexts.
📝 Abstract
Consider the following task taught in introductory optimization courses which addresses challenges articulated by the community at the intersection of (generative) AI and OR: generate the dual of a linear program. LLMs, being trained at web-scale, have the conversion process and many instances of Primal to Dual Conversion (P2DC) at their disposal. Students may thus reasonably expect that LLMs would perform well on the P2DC task. To assess this expectation, this paper introduces DualSchool, a comprehensive framework for generating and verifying P2DC instances. The verification procedure of DualSchool uses the Canonical Graph Edit Distance, going well beyond existing evaluation methods for optimization models, which exhibit many false positives and negatives when applied to P2DC. Experiments performed by DualSchool reveal interesting findings. Although LLMs can recite the conversion procedure accurately, state-of-the-art open LLMs fail to consistently produce correct duals. This finding holds even for the smallest two-variable instances and for derivative tasks, such as correctness, verification, and error classification. The paper also discusses the implications for educators, students, and the development of large reasoning systems.