🤖 AI Summary
Current execution-based training environments suffer from limited scalability and generality, hindering the advancement of machine learning agents. To address this, we propose CTF-Dojo—the first large-scale, executable, Docker-based cyber-defense training environment, encompassing 658 real-world vulnerability challenges—and CTF-Forge, an automated build system enabling fully reproducible, zero-manual-intervention environment generation. Methodologically, we integrate containerized execution, execution-verification feedback, and reinforcement learning to enable execution-aware training of language models within authentic runtime environments. Using only 486 high-quality execution trajectories, we train a 32B model that achieves a Pass@1 score of 31.9% across three benchmarks—representing a maximum improvement of 11.6% over prior methods—and significantly outperforms multiple state-of-the-art closed-source models. Our work establishes a new open-source benchmark for execution-aware agent training.
📝 Abstract
Large language models (LLMs) have demonstrated exceptional capabilities when trained within executable runtime environments, notably excelling at software engineering tasks through verified feedback loops. Yet, scalable and generalizable execution-grounded environments remain scarce, limiting progress in training more capable ML agents. We introduce CTF-Dojo, the first large-scale executable runtime tailored for training LLMs with verifiable feedback, featuring 658 fully functional Capture-The-Flag (CTF)-style challenges containerized in Docker with guaranteed reproducibility. To enable rapid scaling without manual intervention, we develop CTF-Forge, an automated pipeline that transforms publicly available artifacts into ready-to-use execution environments in minutes, eliminating weeks of expert configuration traditionally required. We trained LLM-based agents on just 486 high-quality, execution-verified trajectories from CTF-Dojo, achieving up to 11.6% absolute gains over strong baselines across three competitive benchmarks: InterCode-CTF, NYU CTF Bench, and Cybench. Our best-performing 32B model reaches 31.9% Pass@1, establishing a new open-weight state-of-the-art that rivals frontier models like DeepSeek-V3-0324 and Gemini-2.5-Flash. By framing CTF-style tasks as a benchmark for executable-agent learning, CTF-Dojo demonstrates that execution-grounded training signals are not only effective but pivotal in advancing high-performance ML agents without dependence on costly proprietary systems.