🤖 AI Summary
This study systematically investigates the accuracy–energy trade-off of local intelligent assistants in software development. We evaluate 18 large language models—including Llama-3 and CodeLlama—on consumer-grade and AI-optimized GPUs for typical tasks such as code completion, under FP16, INT4, and INT8 quantization schemes. Using the Hugging Face Transformers framework, we integrate CodeBLEU, execution accuracy, and empirical GPU power consumption (measured via DCGM) and latency. Our analysis reveals, for the first time, a nonlinear relationship between model accuracy and energy consumption. Notably, quantized models—e.g., INT4 Llama-3-70B—achieve 42% lower energy consumption and 11% higher accuracy than full-precision CodeLlama-13B on code completion. Crucially, no universally optimal model exists: cross-task performance variance reaches 37%, necessitating task-specific model selection.
📝 Abstract
The use of generative AI-based coding assistants like ChatGPT and Github Copilot is a reality in contemporary software development. Many of these tools are provided as remote APIs. Using third-party APIs raises data privacy and security concerns for client companies, which motivates the use of locally-deployed language models. In this study, we explore the trade-off between model accuracy and energy consumption, aiming to provide valuable insights to help developers make informed decisions when selecting a language model. We investigate the performance of 18 families of LLMs in typical software development tasks on two real-world infrastructures, a commodity GPU and a powerful AI-specific GPU. Given that deploying LLMs locally requires powerful infrastructure which might not be affordable for everyone, we consider both full-precision and quantized models. Our findings reveal that employing a big LLM with a higher energy budget does not always translate to significantly improved accuracy. Additionally, quantized versions of large models generally offer better efficiency and accuracy compared to full-precision versions of medium-sized ones. Apart from that, not a single model is suitable for all types of software development tasks.