How Effective are Large Language Models in Generating Software Specifications?

📅 2023-06-06
📈 Citations: 17
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the underexplored challenge of automatically generating formal specifications—expressed in first-order logic—from software comments and documentation using large language models (LLMs). Method: We conduct the first systematic evaluation of 13 state-of-the-art LLMs (e.g., Codex, Llama, PaLM) against traditional approaches on three public benchmarks under few-shot settings. We introduce a cross-model failure diagnosis framework, establish a reproducible evaluation benchmark, and propose a taxonomy of failure modes. Contribution/Results: Experiments reveal that certain LLMs achieve performance comparable to or exceeding traditional tools in specific scenarios; however, semantic abstraction, context sensitivity, and logical rigor remain critical bottlenecks. Our analysis uncovers complementary strengths between LLMs and classical methods, providing empirical foundations and concrete directions for advancing LLM-augmented formal methods. The benchmark, taxonomy, and diagnostic framework are publicly released to support reproducible research.
📝 Abstract
Software specifications are essential for many Software Engineering (SE) tasks such as bug detection and test generation. Many existing approaches are proposed to extract the specifications defined in natural language form (e.g., comments) into formal machine readable form (e.g., first order logic). However, existing approaches suffer from limited generalizability and require manual efforts. The recent emergence of Large Language Models (LLMs), which have been successfully applied to numerous SE tasks, offers a promising avenue for automating this process. In this paper, we conduct the first empirical study to evaluate the capabilities of LLMs for generating software specifications from software comments or documentation. We evaluate LLMs performance with Few Shot Learning (FSL) and compare the performance of 13 state of the art LLMs with traditional approaches on three public datasets. In addition, we conduct a comparative diagnosis of the failure cases from both LLMs and traditional methods, identifying their unique strengths and weaknesses. Our study offers valuable insights for future research to improve specification generation.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs in generating software specifications
Compare LLMs with traditional specification extraction methods
Diagnose failures in LLMs and traditional methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Few Shot Learning
Software Specifications Generation
🔎 Similar Papers
No similar papers found.
Danning Xie
Danning Xie
Purdue University
software engineering
B
B. Yoo
Computer Science and Engineering Department, UNIST, South Korea
N
Nan Jiang
Computer Science Department, Purdue University, USA
Mijung Kim
Mijung Kim
UNIST, South Korea
Software engineeringSoftware testing and analysis
Lin Tan
Lin Tan
Mary J. Elmore New Frontiers Professor, Computer Science, Purdue University
LLM4CodeSoftware reliabilityAIText analyticsAutoformalization
X
X. Zhang
Computer Science Department, Purdue University, USA
J
Judy S. Lee
IBM Chief Analytics Office, Armonk, NY, USA