🤖 AI Summary
This work addresses the limited semantic understanding of diverse real-world task formats in existing code pretraining, which hinders model generalization. To overcome this, the authors propose five strategies—CodeEnhance, CodeQA, CodeDev, CodeDialogue, and CodeTrace—that transform large-scale open-source code into semantically rich synthetic data through quality-aware rewriting, templated question generation, developer-task simulation, multi-turn dialogue construction, and cross-lingual execution tracing. The resulting dataset spans 15 programming languages and over 5,000 libraries, enabling the first large-scale (100B+) synthesis of code and execution traces. Based on this, the authors introduce two new evaluation benchmarks, DevEval and TraceEval, which expose significant gaps in current models’ task comprehension and trace prediction capabilities. A trained 3B-parameter model substantially outperforms state-of-the-art models ten times its size on HumanEval (83.5%), MBPP (63.2%), DevEval win rate (8.09%), and TraceEval ROUGE-2 (15.36).
📝 Abstract
Pre-training on raw code teaches syntax but provides sparse signal for diverse real-world task formats. While synthetic data has proven transformative for language models, code remains largely unexplored beyond limited quality improvements. We present CodeAlchemy, a synthetic data generation framework that transforms publicly sourced code into semantically-rich training data through 5 strategies: CodeEnhance (quality-aware rewriting), CodeQA (template-based problems), CodeDev (developer tasks), CodeDialogue (multi-turn conversations), and CodeTrace (execution traces). We process 3 corpora across 15 languages to generate 500B+ tokens of synthetic data plus 350B reasoning tokens, orders of magnitude more than prior efforts. CodeTrace instruments and executes 1.3M+ files across 14 languages and 5K libraries, capturing control flow, state tracking, and library knowledge. We introduce DevEval (developer tasks) and TraceEval (execution prediction) benchmarks; frontier models like Claude Sonnet 4.5 achieve only 5.6% exact match on TraceEval, revealing critical gaps in semantic understanding. Our 3B models achieve 83.5% on HumanEval, 63.2% on MBPP, 8.09% win rate on DevEval, and 15.36 ROUGE-2 on TraceEval, outperforming frontier models 10x the size including 27B Gemma-3 and 32B Granite-4.0.