đ€ AI Summary
Existing LLM inference code lacks a systematic taxonomy of code smells, hindering high-quality integration into software systems.
Method: This paper introduces the novel concept of âLLM code smells,â formally defining five representative categoriesâincluding prompt hardcoding, unvalidated response handling, and context leakageâand establishes the first structured, inference-phaseâspecific classification catalog. We extend the SpecDetect4AI toolchain with static analysis and rule-based pattern matching to enable automated detection.
Contribution/Results: Evaluated on 200 open-source LLM applications, our approach achieves a 60.50% detection coverage rate and an average precision of 86.06%. This work bridges a critical gap in the quality assurance of LLM engineering practice, providing both a theoretical foundation and a practical, automated detection capability to support secure, maintainable, and production-ready LLM integration.
đ Abstract
Large Language Models (LLMs) have gained massive popularity in recent years and are increasingly integrated into software systems for diverse purposes. However, poorly integrating them in source code may undermine software system quality. Yet, to our knowledge, there is no formal catalog of code smells specific to coding practices for LLM inference. In this paper, we introduce the concept of LLM code smells and formalize five recurrent problematic coding practices related to LLM inference in software systems, based on relevant literature. We extend the detection tool SpecDetect4AI to cover the newly defined LLM code smells and use it to validate their prevalence in a dataset of 200 open-source LLM systems. Our results show that LLM code smells affect 60.50% of the analyzed systems, with a detection precision of 86.06%.