Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Large language models struggle to coordinate multidimensional knowledge—such as function signatures, module paths, input-output contracts, and semantics—when encountering novel APIs absent from their training data. To address this, this work proposes NovelAPIBench, the first fully automated benchmark that supports dynamic API discovery and fine-grained failure diagnosis by automatically mining APIs, decomposing knowledge units, generating executable tasks, and attributing errors. Experiments reveal that in-context examples constitute the strongest single knowledge signal; incorporating source code context can inadvertently introduce import errors; and fine-tuning primarily enhances the model’s ability to leverage external knowledge, with demonstrated generalization to unseen libraries. The study further demonstrates that retrieval and parameterized adaptation play complementary rather than substitutive roles in enabling effective use of novel APIs.

📝 Abstract

Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordinate signatures, module paths, input-output contracts, semantics, and executable usage patterns. Existing novel-API benchmarks are typically static, rely on coarse pass/fail metrics, or use synthetic APIs that may not reflect real library evolution. We introduce NovelAPIBench, a fully automated dynamic benchmark that, for any base model and target library, discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failed samples to six diagnostic categories. Across about 1.9K tasks, four base models, and five domains, we compare knowledge injected through retrieval with knowledge internalized through parametric adaptation. We find that knowledge components are not interchangeable: usage examples are the strongest standalone signal, while the best two-component setting pairs signatures with either mechanisms or examples depending on the domain and backbone. Adding more context, especially source code, can hurt by increasing import-path errors. Parametric adaptation also does not replace retrieval once external knowledge is removed; rather, fine-tuning mainly teaches models how to use provided bundles, and this ability transfers to held-out libraries. These results suggest that retrieval and tuning play complementary roles: retrieval supplies volatile API content, while tuning improves procedural integration.

Problem

Research questions and friction points this paper is trying to address.

novel API

knowledge gaps

tool use

LLM benchmarking

API acquisition

Innovation

Methods, ideas, or system contributions that make the work stand out.

NovelAPIBench

dynamic benchmarking

API knowledge decomposition