C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

๐Ÿ“… 2025-12-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the sequence information bottleneck induced by EOS token embeddings in code retrieval, this paper proposes the C2LLM family of contrastive code large language models, built upon Qwen-2.5-Coder. The core innovation is the introduction of an Adaptive Cross-Attention Pooling (Pooling by Multihead Attention, PMA) moduleโ€”the first to enable full-sequence information aggregation under causal modeling constraints while supporting flexible embedding dimension adaptation, thereby replacing conventional mean/max pooling paradigms. Trained via contrastive learning on a 3-million-sample code corpus, C2LLM achieves state-of-the-art performance among same-scale models on the MTEB-Code benchmark: C2LLM-7B ranks first among 7B-parameter models, demonstrating PMAโ€™s effectiveness in enhancing semantic density and discriminability of code representations.

Technology Category

Application Category

๐Ÿ“ Abstract
We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM's causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.
Problem

Research questions and friction points this paper is trying to address.

Improves code retrieval by generating better sequence embeddings from token representations
Overcomes limitations of EOS-based embeddings through adaptive cross-attention pooling
Enables flexible embedding dimensions as an alternative to MRL methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive cross-attention pooling for sequence embeddings
Utilizes pretrained LLM causal representations effectively
Supports flexible embedding dimension adaptation
๐Ÿ”Ž Similar Papers
No similar papers found.