Caruca: Effective and Efficient Specification Mining for Opaque Software Components

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The lack of formal specifications for opaque Unix commands hinders the practicality of program analysis systems. Method: This paper introduces the first automated specification mining approach that synergistically integrates large language models (LLMs) with system-level instrumentation: LLMs parse unstructured command documentation to generate structured syntactic and semantic constraints; concurrent system-call and filesystem instrumentation dynamically observes command behavior across diverse environments, extracting key properties—including input domains, side effects, and error modes. Contribution/Results: Our method achieves the first end-to-end, LLM-driven specification synthesis across commands and heterogeneous documentation formats, supports multiple standard specification outputs (e.g., SMT-LIB, JSON Schema), and integrates seamlessly with static analyzers. Evaluated on 60 commands—including GNU Coreutils, POSIX utilities, and third-party tools—it attains a 98.3% specification correctness rate, fully eliminating manual specification authoring.

Technology Category

Application Category

📝 Abstract
A wealth of state-of-the-art systems demonstrate impressive improvements in performance, security, and reliability on programs composed of opaque components, such as Unix shell commands. To reason about commands, these systems require partial specifications. However, creating such specifications is a manual, laborious, and error-prone process, limiting the practicality of these systems. This paper presents Caruca, a system for automatic specification mining for opaque commands. To overcome the challenge of language diversity across commands, Caruca first instruments a large language model to translate a command's user-facing documentation into a structured invocation syntax. Using this representation, Caruca explores the space of syntactically valid command invocations and execution environments. Caruca concretely executes each command-environment pair, interposing at the system-call and filesystem level to extract key command properties such as parallelizability and filesystem pre- and post-conditions. These properties can be exported in multiple specification formats and are immediately usable by existing systems. Applying Caruca across 60 GNU Coreutils, POSIX, and third-party commands across several specification-dependent systems shows that Caruca generates correct specifications for all but one case, completely eliminating manual effort from the process and currently powering the full specifications for a state-of-the-art static analysis tool.
Problem

Research questions and friction points this paper is trying to address.

Automates specification mining for opaque software components like shell commands
Translates command documentation into structured syntax using large language models
Extracts command properties through concrete execution and system-call interposition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Translates documentation into structured syntax
Explores valid command invocations and environments
Extracts properties via system-call interposition
🔎 Similar Papers
No similar papers found.
E
Evangelos Lamprou
Brown University
S
Seong-Heon Jung
New York University
M
Mayank Keoliya
University of Pennsylvania
L
Lukas Lazarek
Brown University
Konstantinos Kallas
Konstantinos Kallas
Assistant Professor, UCLA
Computer SystemsCompilersProgramming LanguagesFormal Methods
Michael Greenberg
Michael Greenberg
Stevens Institute of Technology
Nikos Vasilakis
Nikos Vasilakis
Brown University