🤖 AI Summary
This study addresses the reliability limitations of open-source large language models (LLMs) in power system code generation, where hallucinated functions and incorrect API usage often stem from poorly defined knowledge boundaries, hindering local deployment. To overcome this without fine-tuning or cloud reliance, the authors propose an intervention framework that integrates an execution-based evaluation benchmark, PowerCodeBench, with a documentation-driven knowledge boundary probing mechanism (L0–L3) and on-demand documentation injection for requirement-guided correction. The approach boosts the accuracy of all open-source LLMs with ≥7B parameters and commercial APIs by 32–56 percentage points. Notably, 70B–120B parameter models achieve performance comparable to mid-tier commercial systems—approaching that of a 480B-parameter model—while reducing prompt token consumption to just 41% of baseline costs.
📝 Abstract
Large language models (LLMs) are increasingly used to automate power-system analysis, but many utilities and energy-research labs require on-premise serving for confidentiality, regulatory, reproducibility, and cost reasons. This makes the reliability of open-weight models a deployment issue. We show that first-pass failures in power-system code generation are dominated not by reasoning alone, but by structured API-knowledge boundary errors: hallucinated function names, misused parameters, and mishandled result tables in versioned simulation libraries.
We introduce PowerCodeBench, an execution-validated benchmark generator that pairs natural-language operator queries with pandapower code and numerical ground truth; an L0-L3 documentation-driven probing procedure that measures per-model API knowledge profiles; and a boundary-aware intervention that combines query-side API demand estimation with targeted proactive documentation injection and routed reactive correction.
On a 2,000-task frozen release, we evaluate ten open-weight LLMs (1.5B-480B parameters) and four commercial mid-tier APIs. The intervention improves every evaluated open-weight model of at least 7B parameters and every commercial API by 32 to 56 accuracy points. Open-weight models in the 70B-120B range match the commercial mid-tier accuracy range, while Llama-3.1-405B and Qwen3-Coder-480B lead the panel. The targeted prompts preserve the full-context accuracy ceiling while using 41% of the prompt-token cost. The result is an accuracy-side, deployment-time path toward reliable on-premise LLM assistance for grid-analysis workflows without fine-tuning or cloud inference.