Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the challenge of enhancing large language model (LLM)-based agents’ automated problem-solving capabilities in offensive security tasks—particularly Capture-the-Flag (CTF) competitions—where current evaluations suffer from coarse granularity, lack of standardized benchmarks, and uninformed hyperparameter tuning. To this end, we propose three key contributions: (1) a fine-grained evaluation framework, CTFJudge, coupled with a novel capability metric, the CTF Capability Index (CCI); (2) a multi-agent collaborative solving architecture that decomposes and orchestrates task-specific reasoning; and (3) CTFTiny, a lightweight, reproducible, and standardized CTF benchmark. Experiments demonstrate that integrating LLM-as-Judge evaluation with systematic hyperparameter optimization (temperature, top-p, max tokens) significantly improves both solution efficiency and accuracy across five core security task categories. All code and benchmark data are publicly released to foster reproducible research in security-oriented intelligent agents.

Technology Category

Application Category

📝 Abstract

Recent advances in LLM agentic systems have improved the automation of offensive security tasks, particularly for Capture the Flag (CTF) challenges. We systematically investigate the key factors that drive agent success and provide a detailed recipe for building effective LLM-based offensive security agents. First, we present CTFJudge, a framework leveraging LLM as a judge to analyze agent trajectories and provide granular evaluation across CTF solving steps. Second, we propose a novel metric, CTF Competency Index (CCI) for partial correctness, revealing how closely agent solutions align with human-crafted gold standards. Third, we examine how LLM hyperparameters, namely temperature, top-p, and maximum token length, influence agent performance and automated cybersecurity task planning. For rapid evaluation, we present CTFTiny, a curated benchmark of 50 representative CTF challenges across binary exploitation, web, reverse engineering, forensics, and cryptography. Our findings identify optimal multi-agent coordination settings and lay the groundwork for future LLM agent research in cybersecurity. We make CTFTiny open source to public https://github.com/NYU-LLM-CTF/CTFTiny along with CTFJudge on https://github.com/NYU-LLM-CTF/CTFJudge.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM hyperparameters for offensive security agents

Evaluating agent performance using CTFJudge and CCI metrics

Developing a lightweight CTF benchmark for rapid testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM as judge for agent trajectory analysis

CTF Competency Index for solution alignment

Hyperparameter tuning for cybersecurity performance

🔎 Similar Papers

NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security