🤖 AI Summary
Existing evaluations of large language models’ (LLMs) engineering capabilities rely heavily on simplified exam-style questions and ad hoc scenarios, failing to reflect the complexity and requirements of real-world engineering tasks. Method: We introduce the first production-oriented engineering benchmark—comprising 100+ authentic industrial problems—spanning core competencies including product design, fault diagnosis, and state prediction. We further propose a multidimensional engineering capability evaluation framework that systematically assesses state-of-the-art models (GPT-4, Claude, Llama3, and Qwen) across abstraction reasoning, formal modeling, and context-sensitive engineering logic. Contribution/Results: Empirical results reveal that while models achieve moderate performance on temporal and structural reasoning, their accuracy on critical engineering capabilities remains below 35%, exposing substantial bottlenecks. This work establishes a rigorous, empirically grounded benchmark to characterize and advance LLMs’ engineering proficiency in industrial settings.
📝 Abstract
Large Language Models (LLMs) are transformative not only for daily activities but also for engineering tasks. However, current evaluations of LLMs in engineering exhibit two critical shortcomings: (i) the reliance on simplified use cases, often adapted from examination materials where correctness is easily verifiable, and (ii) the use of ad hoc scenarios that insufficiently capture critical engineering competencies. Consequently, the assessment of LLMs on complex, real-world engineering problems remains largely unexplored. This paper addresses this gap by introducing a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios, systematically designed to cover core competencies such as product design, prognosis, and diagnosis. Using this dataset, we evaluate four state-of-the-art LLMs, including both cloud-based and locally hosted instances, to systematically investigate their performance on complex engineering tasks. Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.