🤖 AI Summary
To address inconsistent manual logging quality, privacy concerns, and high computational overhead of large language models (LLMs), this work systematically investigates the feasibility of small open-language models (SOLMs, 1B–14B parameters) for high-quality log statement generation. We propose a synergistic fine-tuning framework integrating retrieval-augmented generation (RAG) and low-rank adaptation (LoRA), the first to empirically validate SOLMs’ competitiveness in log generation. Our instruction-tuned Qwen2.5-Coder-14B model outperforms existing tools and LLM baselines on both log location prediction and log statement generation, achieving state-of-the-art (SOTA) results under conventional metrics and LLM-as-a-judge evaluation. Cross-repository generalization tests further demonstrate strong transferability. The core contribution is establishing SOLMs as a new paradigm for log generation—balancing performance, data privacy, and inference efficiency.
📝 Abstract
Effective software maintenance heavily relies on high-quality logging statements, but manual logging is challenging, error-prone, and insufficiently standardized, often leading to inconsistent log quality. While large language models have shown promise in automatic logging, they introduce concerns regarding privacy, resource intensity, and adaptability to specific enterprise needs. To tackle these limitations, this paper empirically investigates whether Small Open-source Language Models (SOLMs) could become a viable alternative via proper exploitation. Specifically, we conduct a large-scale empirical study on four prominent SOLMs, systematically evaluating the impacts of various interaction strategies, parameter-efficient fine-tuning techniques, model sizes, and model types in automatic logging. Our key findings reveal that Retrieval-Augmented Generation significantly enhances performance, and LoRA is a highly effective PEFT technique. While larger SOLMs tend to perform better, this involves a trade-off with computational resources, and instruct-tuned SOLMs generally surpass their base counterparts. Notably, fine-tuned SOLMs, particularly Qwen2.5-coder-14B, outperformed existing specialized tools and LLM baselines in accurately predicting logging locations and generating high-quality statements, a conclusion supported by traditional evaluation metrics and LLM-as-a-judge evaluations. Furthermore, SOLMs also demonstrated robust generalization across diverse, unseen code repositories.