🤖 AI Summary
Existing code embedding models suffer from inadequate fine-grained syntactic and contextual modeling; open-source alternatives (e.g., CodeBERT, UniXcoder) exhibit poor scalability, while proprietary models incur prohibitive computational costs. To address these limitations, we propose a task- and language-decoupled LoRA adapter framework—the first to enable parameter-efficient fine-tuning along both *task type* (Code2Code vs. Text2Code) and *programming language* dimensions, introducing fewer than 2% additional parameters. The framework is trained end-to-end on a 2-million-sample multilingual code corpus and achieves full adapter adaptation within 25 minutes on two H100 GPUs. Experimental results demonstrate substantial improvements: +9.1% MRR in Code2Code retrieval and up to +86.69% accuracy gain in Text2Code tasks. Our approach significantly enhances semantic retrieval precision while drastically improving deployment efficiency and model adaptability across diverse programming languages and downstream tasks.
📝 Abstract
Code embeddings are essential for semantic code search; however, current approaches often struggle to capture the precise syntactic and contextual nuances inherent in code. Open-source models such as CodeBERT and UniXcoder exhibit limitations in scalability and efficiency, while high-performing proprietary systems impose substantial computational costs. We introduce a parameter-efficient fine-tuning method based on Low-Rank Adaptation (LoRA) to construct task-specific adapters for code retrieval. Our approach reduces the number of trainable parameters to less than two percent of the base model, enabling rapid fine-tuning on extensive code corpora (2 million samples in 25 minutes on two H100 GPUs). Experiments demonstrate an increase of up to 9.1% in Mean Reciprocal Rank (MRR) for Code2Code search, and up to 86.69% for Text2Code search tasks across multiple programming languages. Distinction in task-wise and language-wise adaptation helps explore the sensitivity of code retrieval for syntactical and linguistic variations.