🤖 AI Summary
Existing approaches struggle to achieve efficient and accurate bidirectional function matching between source code and decompiled code under standard preprocessing conditions that strip identifiers. This work proposes an embedding model based on Qwen3-Embedding, fine-tuned via contrastive learning to construct a semantically aligned embedding space without relying on symbolic information. For the first time, it enables bidirectional cross-representation code matching in fully de-identified settings. The method significantly outperforms current state-of-the-art techniques across multiple function-matching benchmarks, substantially improving both matching accuracy and scalability. Notably, it also demonstrates strong generalization capabilities on constant-algorithm matching tasks, despite not being explicitly trained for them.
📝 Abstract
Function association is a useful process for binary reverse engineers. Search tools exist to perform association at scale, but they do not utilize the full range of capabilities that AI-enabled search provides. Prior work has explored the development of embedding models for association between certain reverse engineering code representations, but that work does not cover bidirectional association between source code and decompiled, stripped code with standard preprocessing requirements. To bridge this gap, we formalize this function association problem and evaluate the extent to which embedding models can bidirectionally associate between these two representations. To improve model performance at this task, we fine-tune a Qwen3-Embedding model with contrastive learning. We find that our new model outperforms other models on all function association baselines by a substantial margin and generalizes to a constant-algorithm association task it is not explicitly trained on.