🤖 AI Summary
This work addresses code retrieval, technical question answering, and cross-lingual semantic matching—three core tasks in code intelligence. We propose a lightweight, efficient small-scale code embedding model. Methodologically, we introduce a text-code joint autoregressive pre-trained language model, departing from conventional contrastive learning or dual-encoder paradigms; high-quality code representations are derived solely via last-token pooling over the generative output sequence. This is further refined through supervised fine-tuning to optimize the embedding space. Our design drastically reduces computational overhead without increasing parameter count. Empirically, the model achieves state-of-the-art performance on CodeSearchNet benchmarks across all three tasks—code retrieval, cross-lingual similarity identification, and technical QA—demonstrating both the effectiveness and scalability of generative architectures for code representation learning.
📝 Abstract
jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.