Identifier-Free Code Embedding Models for Scalable Search

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Existing approaches struggle to achieve efficient and accurate bidirectional function matching between source code and decompiled code under standard preprocessing conditions that strip identifiers. This work proposes an embedding model based on Qwen3-Embedding, fine-tuned via contrastive learning to construct a semantically aligned embedding space without relying on symbolic information. For the first time, it enables bidirectional cross-representation code matching in fully de-identified settings. The method significantly outperforms current state-of-the-art techniques across multiple function-matching benchmarks, substantially improving both matching accuracy and scalability. Notably, it also demonstrates strong generalization capabilities on constant-algorithm matching tasks, despite not being explicitly trained for them.

📝 Abstract

Function association is a useful process for binary reverse engineers. Search tools exist to perform association at scale, but they do not utilize the full range of capabilities that AI-enabled search provides. Prior work has explored the development of embedding models for association between certain reverse engineering code representations, but that work does not cover bidirectional association between source code and decompiled, stripped code with standard preprocessing requirements. To bridge this gap, we formalize this function association problem and evaluate the extent to which embedding models can bidirectionally associate between these two representations. To improve model performance at this task, we fine-tune a Qwen3-Embedding model with contrastive learning. We find that our new model outperforms other models on all function association baselines by a substantial margin and generalizes to a constant-algorithm association task it is not explicitly trained on.

Problem

Research questions and friction points this paper is trying to address.

function association

code embedding

binary reverse engineering

decompiled code

identifier-free

Innovation

Methods, ideas, or system contributions that make the work stand out.

identifier-free

code embedding

function association