Benchmarking Web API Integration Code Generation

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior studies lack dedicated benchmarks and clear understanding of large language models’ (LLMs) capabilities in automatically generating Web API integration code. Method: We introduce the first structured, API-call-oriented evaluation dataset and an end-to-end assessment pipeline, integrating API specification parsing, functional correctness verification, and real-world execution analysis. Experiments span major open-source LLMs across diverse API integration tasks. Contribution/Results: Our evaluation reveals severe deficiencies—endpoint identification, parameter binding, and request construction—yielding a maximum task success rate below 40%. Models frequently exhibit hallucination, parameter misuse, and protocol violations. This work establishes the first multidimensional, reproducible, quantitative benchmark for assessing LLMs’ Web API code generation capability, providing a foundational dataset, rigorous methodology, and empirical evidence to advance research in API-aware intelligent programming.

Technology Category

Application Category

📝 Abstract
API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models~(LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models were able to solve more than 40% of the tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate correct web API integration code
Addressing challenges in automated web API invocation code generation
Assessing code generation errors like hallucinated endpoints and incorrect arguments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset and pipeline for evaluating LLMs
Assessing API invocation code generation ability
Revealing challenges like hallucinated endpoints
🔎 Similar Papers
No similar papers found.
D
Daniel Maninger
Technische Universität Darmstadt, Darmstadt, Germany, Hessian Center for Artificial Intelligence (hessian.AI), Darmstadt, Germany
L
Leon Chemnitz
Pariton AI, Berlin, Germany
A
Amir Molzam Sharifloo
Technische Universität Darmstadt, Darmstadt, Germany
J
Jannis Brugger
Technische Universität Darmstadt, Darmstadt, Germany, Hessian Center for Artificial Intelligence (hessian.AI), Darmstadt, Germany
Mira Mezini
Mira Mezini
Professor of Computer Science, TU Darmstadt, Germany
Programming LanguagesSoftware EngineeringProgram AnalysisSoftware SecurityReactive Programming