Efficient Code Embeddings from Code Generation Models

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses code retrieval, technical question answering, and cross-lingual semantic matching—three core tasks in code intelligence. We propose a lightweight, efficient small-scale code embedding model. Methodologically, we introduce a text-code joint autoregressive pre-trained language model, departing from conventional contrastive learning or dual-encoder paradigms; high-quality code representations are derived solely via last-token pooling over the generative output sequence. This is further refined through supervised fine-tuning to optimize the embedding space. Our design drastically reduces computational overhead without increasing parameter count. Empirically, the model achieves state-of-the-art performance on CodeSearchNet benchmarks across all three tasks—code retrieval, cross-lingual similarity identification, and technical QA—demonstrating both the effectiveness and scalability of generative architectures for code representation learning.

Technology Category

Application Category

📝 Abstract
jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.
Problem

Research questions and friction points this paper is trying to address.

Retrieve code from natural language queries
Perform technical question-answering tasks
Identify similar code snippets across languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses autoregressive backbone pre-trained
Generates embeddings via last-token pooling
Achieves state-of-the-art performance small models
🔎 Similar Papers
No similar papers found.
D
Daria Kryvosheieva
Massachusetts Institute of Technology
Saba Sturua
Saba Sturua
ML Research Engineer
Natural Language ProcessingMachine Learning
M
Michael Günther
Jina AI GmbH
S
Scott Martens
Jina AI GmbH
H
Han Xiao
Jina AI GmbH