Efficient Code Embeddings from Code Generation Models

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This work addresses code retrieval, technical question answering, and cross-lingual semantic matching—three core tasks in code intelligence. We propose a lightweight, efficient small-scale code embedding model. Methodologically, we introduce a text-code joint autoregressive pre-trained language model, departing from conventional contrastive learning or dual-encoder paradigms; high-quality code representations are derived solely via last-token pooling over the generative output sequence. This is further refined through supervised fine-tuning to optimize the embedding space. Our design drastically reduces computational overhead without increasing parameter count. Empirically, the model achieves state-of-the-art performance on CodeSearchNet benchmarks across all three tasks—code retrieval, cross-lingual similarity identification, and technical QA—demonstrating both the effectiveness and scalability of generative architectures for code representation learning.

Technology Category

Application Category

📝 Abstract

jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.

Problem

Research questions and friction points this paper is trying to address.

Retrieve code from natural language queries

Perform technical question-answering tasks

Identify similar code snippets across languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses autoregressive backbone pre-trained

Generates embeddings via last-token pooling

Achieves state-of-the-art performance small models

🔎 Similar Papers

No similar papers found.