🤖 AI Summary
This paper addresses systematic prompt-induced biases in large language models (LLMs) for code clone detection. We propose the first prompt bias taxonomy and mitigation framework specifically designed for clone detection. We systematically identify eight representative bias categories, reinterpret model misclassifications as corrective bias signals, and introduce an F1-guided prompt rewriting strategy for bias-driven prompt optimization. Evaluated on generative LLMs—including PaLM—our approach achieves F1 scores of 89.30 and 86.41 on the Avatar and PoolC benchmarks, respectively, representing up to a 10.81% improvement over baselines. Our core contributions are threefold: (1) a principled, interpretable taxonomy of prompt biases; (2) a bias-aware prompt engineering paradigm grounded in empirical error analysis; and (3) rigorous validation of its effectiveness and generalizability across realistic clone detection tasks and diverse model architectures.
📝 Abstract
The issue of clone code has persisted in software engineering, primarily because developers often copy and paste code segments. This common practice has elevated the importance of clone code detection, garnering attention from both software engineering researchers and industry professionals. Their collective concern arises from the potential negative impacts that clone code can have on software quality. The emergence of powerful Generative Large Language Models (LLMs) like ChatGPT has exacerbated the clone code problem. These advanced models possess code generation capabilities that can inadvertently create code clones. As a result, the need to detect clone code has become more critical than ever before. In this study, we assess the suitability of LLMs for clone code detection. Our results demonstrate that the Palm model achieved a high F1 score of 89.30 for the avatar dataset and 86.41 for the poolC dataset. A known issue with LLMs is their susceptibility to prompt bias, where the performance of these models fluctuates based on the input prompt provided. In our research, we delve deeper into the reasons behind these fluctuations and propose a framework to mitigate prompt bias for clone detection. Our analysis identifies eight distinct categories of prompt bias, and our devised approach leveraging these biases yields a significant improvement of up to 10.81% in the F1 score. These findings underscore the substantial impact of prompt bias on the performance of LLMs and highlight the potential for leveraging model errors to alleviate this bias.