Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark

Minghao Shao , Nanda Rani , Kimberly Milner , Haoran Xi , Meet Udeshi , Saksham Aggarwal , Venkata Sai Charan Putrevu , Sandeep Kumar Shukla

show 4 more authors

Prashanth Krishnamurthy Farshad Khorrami Ramesh Karri Muhammad Shafique

Authors on Pith no claims yet

classification 💻 cs.CR cs.AI

keywords agentctfjudgectftinyoffensivesecurityacrossagentsbenchmark

0 comments

read the original abstract

Recent advances in LLM agentic systems have improved the automation of offensive security tasks, particularly for Capture the Flag (CTF) challenges. We systematically investigate the key factors that drive agent success and provide a detailed recipe for building effective LLM-based offensive security agents. First, we present CTFJudge, a framework leveraging LLM as a judge to analyze agent trajectories and provide granular evaluation across CTF solving steps. Second, we propose a novel metric, CTF Competency Index (CCI) for partial correctness, revealing how closely agent solutions align with human-crafted gold standards. Third, we examine how LLM hyperparameters, namely temperature, top-p, and maximum token length, influence agent performance and automated cybersecurity task planning. For rapid evaluation, we present CTFTiny, a curated benchmark of 50 representative CTF challenges across binary exploitation, web, reverse engineering, forensics, and cryptography. Our findings identify optimal multi-agent coordination settings and lay the groundwork for future LLM agent research in cybersecurity. We make CTFTiny open source to public https://github.com/NYU-LLM-CTF/CTFTiny along with CTFJudge on https://github.com/NYU-LLM-CTF/CTFJudge.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Autonomous Adversary: Red-Teaming in the age of LLM
cs.CR 2026-05 unverdicted novelty 5.0

Expert-defined action plans for LLM agents achieve higher task completion in lateral-movement scenarios than fully autonomous or self-scaffolded modes, but failures remain common due to brittle commands and state handling.
RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs
cs.CR 2026-04 unverdicted novelty 4.0

RAVEN combines LLM agents and RAG to generate Project Zero-style vulnerability reports, achieving 54.21% average quality on 105 NIST-SARD samples across 15 CWE types.