🤖 AI Summary
This study addresses the challenges of automatically generating structured test cases from natural language requirements, which are often hindered by requirement ambiguity and poor generation quality. The authors systematically evaluate the effectiveness of LoRA-based parameter-efficient fine-tuning across a range of open-source and proprietary large language models, employing hyperparameter optimization and a unified experimental protocol. They further introduce a novel nine-dimensional automated evaluation framework powered by GPT-4o. Experimental results demonstrate that the LoRA-fine-tuned Mistral-8B model achieves the best performance among open-source models, approaching that of the pre-fine-tuned GPT-4.1 and substantially narrowing the performance gap between open-source and proprietary models. These findings validate the practical viability of well-tuned open-source models as effective alternatives in this domain.
📝 Abstract
Automated test case generation from natural language requirements remains a challenging problem in software engineering due to the ambiguity of requirements and the need to produce structured, executable test artifacts. Recent advances in LLMs have shown promise in addressing this task; however, their effectiveness depends on task-specific adaptation and efficient fine-tuning strategies. In this paper, we present a comprehensive empirical study on the use of parameter-efficient fine-tuning, specifically LoRA, for requirement-based test case generation. We evaluate multiple LLM families, including open-source and proprietary models, under a unified experimental pipeline. The study systematically explores the impact of key LoRA hyperparameters, including rank, scaling factor, and dropout, on downstream performance. We propose an automated evaluation framework based on GPT-4o, which assesses generated test cases across nine quality dimensions. Experimental results demonstrate that LoRA-based fine-tuning significantly improves the performance of all open-source models, with Ministral-8B achieving the best results among them. Furthermore, we show that a fine-tuned 8B open-source model can achieve performance comparable to pre-fine-tuned GPT-4.1 models, highlighting the effectiveness of parameter-efficient adaptation. While GPT-4.1 models achieve the highest overall performance, the performance gap between proprietary and open-source models is substantially reduced after fine-tuning. These findings provide important insights into model selection, fine-tuning strategies, and evaluation methods for automated test generation. In particular, they demonstrate that cost-efficient, locally deployable open-source models can serve as viable alternatives to proprietary systems when combined with well-designed fine-tuning approaches.