SIMCODE: A Benchmark for Natural Language to ns-3 Network Simulation Code Generation

📅 2025-07-15

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Large language models (LLMs) lack systematic, domain-specific evaluation benchmarks for generating executable ns-3 network simulation code from natural language specifications. Method: We introduce SIMCODE—the first dedicated benchmark for natural-language-to-ns-3-code translation—comprising 400 graded tasks with fully executable test cases. It establishes a novel three-tiered evaluation framework assessing semantic correctness, syntactic validity, and runtime execution success, enabling rigorous comparison of prompt engineering and task-specific fine-tuning strategies. Contribution/Results: We evaluate Gemini-2.0, GPT-4.1, Qwen-3, and six prompting techniques; GPT-4.1 achieves the highest overall performance, yet execution accuracy remains constrained by critical errors—including missing header files and API mismatches. SIMCODE fills a critical gap in domain-specific simulation code generation evaluation, providing a reproducible, extensible infrastructure to advance trustworthy LLM deployment in systems-level network modeling.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in code generation across various domains. However, their effectiveness in generating simulation scripts for domain-specific environments like ns-3 remains underexplored. Despite the growing interest in automating network simulations, existing tools primarily focus on interactive automation over rigorous evaluation. To facilitate systematic evaluation, we introduce SIMCODE, the first benchmark to evaluate LLMs' ability to generate ns-3 simulation code from natural language. SIMCODE includes 400 tasks across introductory, intermediate, and advanced levels, with solutions and test cases. Using SIMCODE, we evaluate three prominent LLMs, Gemini-2.0, GPT-4.1, and Qwen-3, across six prompt techniques. Furthermore, investigating task-specific fine-tuning's impact reveals that while GPT-4.1 outperforms others, execution accuracy remains modest, with substantial room for improvement. Error analysis identifies missing headers and API mismatches as dominant failures. Nevertheless, SIMCODE provides a foundational step toward evaluating LLMs and research in domain-aware generative systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate ns-3 simulation code from natural language

Assessing performance of LLMs in domain-specific script generation for network simulations

Identifying common errors in LLM-generated ns-3 code like missing headers and API mismatches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SIMCODE benchmark for ns-3 code generation

Evaluates LLMs using six prompt techniques

Identifies key errors like missing headers

🔎 Similar Papers

ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages