Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

πŸ“… 2026-06-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the lack of systematic evaluation of large language models (LLMs) on complex office automation tasks, particularly in long-horizon planning, precise parameter configuration, and cross-application coordination. The authors propose the first standardized benchmark derived from China’s National Computer Rank Examination (NCRE), comprising 200 hands-on tasks across Word, Excel, and PowerPoint, with an automated scoring system encompassing 7,118 fine-grained criteria. They introduce a Score Rate metric to quantify LLMs’ document automation proficiency and implement an end-to-end agent architecture integrating execution feedback, iterative repair, and cross-Office interoperability. The best-performing agent achieves a Score Rate of 68.8%, substantially below the human reference score of 95.5%, revealing a significant performance gap in fine-grained office automation capabilities.
πŸ“ Abstract
The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China's National Computer Rank Examination (NCRE), featuring 200 comprehensive practical-operation tasks across Word, Excel, and PowerPoint. Each task is scored on a 100-point rubric scale using 7,118 machine-gradable criteria, and Score Rate (SR) denotes the mean percentage of rubric points earned across these tasks. We benchmark 7 frontier LLMs and observe stark limitations: single-turn models score a maximum of 36.6%. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68.8%, but remains below the 95.5% community-reference score used as a scoring sanity check. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine-grained Office document automation remains a significant challenge for current code-generating LLM and agent systems.
Problem

Research questions and friction points this paper is trying to address.

Office automation
Large Language Models
document automation
productivity software
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Office automation benchmark
Large Language Model agents
practical-operation tasks
execution feedback and iterative repair
fine-grained document automation
πŸ”Ž Similar Papers
2024-03-15arXiv.orgCitations: 2