Prompt-Driven Code Summarization: A Systematic Literature Review

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This study addresses the heavy reliance of large language models on prompt design for code summarization tasks and the absence of systematic comparisons and unified evaluation standards across diverse prompting strategies. Through a comprehensive literature review, it integrates and categorizes mainstream approaches—including few-shot prompting, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning—and analyzes their effectiveness across different models and scenarios. The work highlights the limitations of current evaluations that overly depend on surface-level overlap metrics, delineates the conditions under which each prompting paradigm performs best, and proposes a unified evaluation framework to guide future research and practical applications in this domain.

Technology Category

Application Category

📝 Abstract

Software documentation is essential for program comprehension, developer onboarding, code review, and long-term maintenance. Yet producing quality documentation manually is time-consuming and frequently yields incomplete or inconsistent results. Large language models (LLMs) offer a promising solution by automatically generating natural language descriptions from source code, helping developers understand code more efficiently, facilitating maintenance, and supporting downstream activities such as defect localization and commit message generation. However, the effectiveness of LLMs in documentation tasks critically depends on how they are prompted. Properly structured instructions can substantially improve model performance, making prompt engineering-the design of input prompts to guide model behavior-a foundational technique in LLM-based software engineering. Approaches such as few-shot prompting, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning show promise for code summarization, yet current research remains fragmented. There is limited understanding of which prompting strategies work best, for which models, and under what conditions. Moreover, evaluation practices vary widely, with most studies relying on overlap-based metrics that may not capture semantic quality. This systematic literature review consolidates existing evidence, categorizes prompting paradigms, examines their effectiveness, and identifies gaps to guide future research and practical adoption.

Problem

Research questions and friction points this paper is trying to address.

code summarization

large language models

prompt engineering

software documentation

systematic literature review

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt engineering

code summarization

large language models