π€ AI Summary
Existing post-link optimizers are constrained by intra-procedural code layout and struggle to effectively exploit inter-procedural layout opportunities due to the combinatorial explosion of the search space and the complexity of call/return semantics. This work proposes the Magellan agent-based workflow, which upgrades Propellerβs heuristic approach into a fine-grained inter-procedural optimizer. For the first time in warehouse-scale industrial applications, it enables evolutionary search guided by real hardware performance counter feedback, thereby transcending the limitations of traditional static cost models. By integrating the AlphaEvolve agent, evolutionary algorithms, and fine-grained code reordering, the method achieves significant performance gains of 0.23%β1.6% even on binaries already optimized with state-of-the-art feedback-directed optimization (FDO) and profile-guided layout optimization (PLO).
π Abstract
Post-link optimizers (PLOs) such as Propeller and BOLT have demonstrated that precise, profile-guided code layout can extract significant performance gains from heavily optimized binaries. However, these systems are currently restricted to intraprocedural techniques, leaving the global potential of interprocedural layout largely untapped. Interprocedural code layout is historically difficult due to a combinatorially intractable search space and complex call-return semantics that are challenging to model. Consequently, the performance potential of fine-grained interprocedural layout remains unproven in practice. AI-PROPELLER uses Magellan, an agentic workflow that evolves the compiler heuristic in Propeller into a fine-grained interprocedural optimizer and fine-tunes the resulting policy hyperparameters. To ensure high-fidelity, we move away from approximate static cost models and the agentic workflow generates multiple layout variants that are executed on actual hardware to measure real performance counters, providing a precise reward signal for the evolutionary loop. AI-PROPELLER has been evaluated on several benchmarks including large warehouse-scale applications and experiments show performance improvements of 0.23% to 1.6% optimized with state-of-the-art FDO and PLO which is significant for real-world binaries. This is the first time ever that large warehouse-scale applications in industrial settings have been optimized with fine-grained interprocedural code layout.