pith. machine review for the scientific record. sign in

arxiv: 2605.11504 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.CR

Recognition: 1 theorem link

· Lean Theorem

CTFusion: A CTF-based Benchmark for LLM Agent Evaluation

Dongjun Lee, Ga-eun Bae, Insu Yun

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:33 UTC · model grok-4.3

classification 💻 cs.LG cs.CR
keywords CTFLLM agentsbenchmarkdata contaminationcybersecurityevaluation frameworklive CTF
0
0 comments X

The pith

Reused CTF challenges allow data contamination that inflates LLM agent scores, which CTFusion fixes by streaming evaluations from live events.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing CTF benchmarks reuse old challenges, enabling agents to cheat via web search or memorized solutions and producing unreliable results for cybersecurity tasks. Experiments with an agent equipped with search tools confirm that contamination occurs in practice on static benchmarks. CTFusion counters this by running on live CTF events, keeping each agent's session independent even under one team account, and forwarding only the first correct flag per challenge to limit disruption to the competition. The system is built as an MCP server for the common CTFd platform so it works with many events and agent designs. Tests across three LLMs, two agents, and five live CTFs indicate that this live approach yields more trustworthy assessments than reused datasets.

Core claim

CTFusion is a streaming evaluation framework built on live CTFs that preserves per-agent independence under a single team account and reduces competition impact by forwarding only the first correct flag per challenge, implemented as an MCP server on CTFd to support diverse events and agents.

What carries the argument

CTFusion streaming framework on CTFd, which enforces per-agent independence and first-flag-only forwarding to prevent contamination and competition effects during live evaluations.

Load-bearing premise

Live CTF events stay uncontaminated and the independence rule plus first-flag forwarding fully blocks data leakage and competition distortion.

What would settle it

An agent using web search succeeding on a CTFusion challenge whose live event had no prior public solutions or leaks.

Figures

Figures reproduced from arXiv: 2605.11504 by Dongjun Lee, Ga-eun Bae, Insu Yun.

Figure 1
Figure 1. Figure 1: Monthly distribution of CTF competitions (2025). 2.2. Related Work LLM Agents for Vulnerability Discovery. A va￾riety of LLM agents have been developed to automate vulnerability discovery in CTFs and real-world systems. ENIGMA (Abramovich et al., 2025) first demonstrated this capability, and D-CIPHER (Udeshi et al., 2025) further im￾proved upon it by incorporating an auto-prompter agent to guide exploitati… view at source ↗
Figure 2
Figure 2. Figure 2: Success rates of D-CIPHER-WEB and D-CIPHER on NYU CTF BENCH [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evidence of direct flag retrieval by D-CIPHER-WEB for the “1nsayne” challenge. 3.2. Cheating Evidence in Evaluations We evaluated D-CIPHER-WEB on NYU CTF BENCH and found that it achieved a substantially higher success rate than the original D-CIPHER. We conducted these evaluations us￾ing the same conditions and metrics as in §5.1. Specifically, D-CIPHER-WEB solved 24.07% of problems compared to D-CIPHER’s … view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the CTFUSION framework architecture. 4. CTFUSION 4.1. Overview CTFUSION enables fair evaluation of multiple agents on LIVE CTFS while keeping the live competition intact. In a live CTF, each account corresponds to a team on the public scoreboard. Creating one account per agent would inflate team counts and distort rankings. Sharing a single account avoids this, but it breaks independence: once … view at source ↗
Figure 5
Figure 5. Figure 5: Performance Comparison: Live CTFs vs NYU CTF Bench 15.0% (×2.4), and CLAUDE 3.5-SONNET from 5.1% to 11.4% (×2.2). At the agent level, ENIGMA increased from 7.2% to 16.8% (×2.3), and D-CIPHER from 5.3% to 12.6% (×2.4). Two primary factors explain the observed performance gap: • Task difficulty: Although both LIVE CTFS and NYU CTF BENCH evaluations use identical envi￾ronments and interaction, problem difficu… view at source ↗
Figure 6
Figure 6. Figure 6: Default prompt for D-CIPHER-WEB. Default prompt [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Specialized prompt for pwn challenges in D-CIPHER-WEB. prompt defines a two-role workflow—Planner and Executor. The Planner produces an iterative, step-by-step plan for solving the designated CTF challenge, and the Executor performs the delegated tasks within the containerized evaluation. The agent pair is provided with a Linux environment, access to network interfaces, and web search access. The prompt re… view at source ↗
Figure 8
Figure 8. Figure 8: Success rates across five LIVE CTFS and NYU CTF BENCH. CubeCTF. We participated in CUBECTF, which ran from July 4, 2025, 22:16 UTC to July 7, 2025, 00:25 UTC. The competition hosted 1,059 teams and included 16 problems, of which 375 teams solved one or more. GPT-4.1 with ENIGMA ranked 163rd, while GPT-4.1 with D-CIPHER ranked 180th. CLAUDE 3.5-SONNET with ENIGMA did not achieve a rank, but CLAUDE 3.5-SONNE… view at source ↗
Figure 10
Figure 10. Figure 10: Problem-solving rates of all model-agent combinations on UIUCTF. GPT-4.1 Claude 3.5-Sonnet Gemini 2.5-Flash 1.82% 0.00% 0.00% 0.00% 1.82% 1.82% EnIGMA D-cipher [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Problem-solving rates of all model-agent combinations on WWCTF. GPT-4.1 Claude 3.5-Sonnet Gemini 2.5-Flash 2.70% 0.00% 0.00% 0.00% 0.00% 0.00% EnIGMA D-cipher [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Problem-solving rates of all model-agent combinations on SCRIPTCTF. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Problem-solving rates for all model-agent pairs on 2023-Quals. GPT-4.1 Claude 3.5-Sonnet Gemini 2.5-Flash 0.00% 18.75% 0.00% 18.75% 6.25% 12.50% EnIGMA D-cipher [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: Problem-solving rates for all model-agent pairs on 2022-Quals. GPT-4.1 Claude 3.5-Sonnet Gemini 2.5-Flash 9.09% 9.09% 0.00% 0.00% 9.09% 0.00% EnIGMA D-cipher [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 18
Figure 18. Figure 18: Problem-solving rates for all model-agent pairs on 2021-Quals. GPT-4.1 Claude 3.5-Sonnet Gemini 2.5-Flash 11.11% 11.11% 11.11% 0.00% 0.00% 0.00% EnIGMA D-cipher [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗
Figure 20
Figure 20. Figure 20: Problem-solving rates for all model-agent pairs on 2020-Quals. GPT-4.1 Claude 3.5-Sonnet Gemini 2.5-Flash 40.00% 10.00% 10.00% 20.00% 30.00% 20.00% EnIGMA D-cipher [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗
Figure 22
Figure 22. Figure 22: Problem-solving rates for all model-agent pairs on 2019-Quals. GPT-4.1 Claude 3.5-Sonnet Gemini 2.5-Flash 14.29% 14.29% 14.29% 14.29% 28.57% 14.29% EnIGMA D-cipher [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗
Figure 24
Figure 24. Figure 24: Problem-solving rates for all model-agent pairs on 2018-Quals. GPT-4.1 Claude 3.5-Sonnet Gemini 2.5-Flash 11.11% 11.11% 11.11% 11.11% 11.11% 11.11% EnIGMA D-cipher [PITH_FULL_IMAGE:figures/full_fig_p021_24.png] view at source ↗
Figure 26
Figure 26. Figure 26: Problem-solving rates for all model-agent pairs on 2017-Quals. GPT-4.1 Claude 3.5-Sonnet Gemini 2.5-Flash 0.00% 0.00% 0.00% 14.29% 0.00% 0.00% EnIGMA D-cipher [PITH_FULL_IMAGE:figures/full_fig_p021_26.png] view at source ↗
read the original abstract

Recent advances in Large Language Models (LLMs) have enabled agentic systems for complex, multi-step tasks; cybersecurity is emerging as a prominent application. To evaluate such agents, researchers widely adopt Capture The Flag (CTF) benchmarks. However, current CTF benchmarks reuse existing challenges, which exposes them to data contamination and potential cheating. Notably, we confirmed these issues in practice by integrating web search tools into an existing agent. To address these limitations, we present CTFusion, a streaming evaluation framework built on Live CTFs. To achieve this, CTFusion preserves per-agent independence under a single team account and reduces competition impact by forwarding only the first correct flag per challenge. Moreover, we implement CTFusion as a Model Context Protocol (MCP) server on the widely used CTFd platform, which offers broad applicability to diverse CTF events and agent types. Through experiments with three LLMs, two agents, and five Live CTFs, we demonstrate that existing CTF benchmarks can be unreliable in assessing LLM-based agents, while CTFusion can serve as a robust solution for evaluating cybersecurity agents. We release CTFusion as open source to foster future research in this area.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that existing CTF benchmarks for LLM agents are unreliable due to data contamination and cheating (demonstrated via web-search tool integration experiments), and introduces CTFusion as a streaming framework for live CTFs. CTFusion achieves per-agent independence under a single team account and reduces competition impact by forwarding only the first correct flag per challenge; it is implemented as an MCP server on the CTFd platform. Experiments with three LLMs, two agents, and five live CTFs are used to support that CTFusion provides a robust alternative, with open-source release.

Significance. If the isolation and forwarding mitigations hold, this work addresses a critical and timely limitation in agent evaluation for cybersecurity, where reliable benchmarks are scarce. The open-source implementation on a widely used platform (CTFd) and the multi-LLM/multi-agent experimental setup are concrete strengths that could enable reproducible follow-up work and broader adoption in LLM agent research.

major comments (3)
  1. [CTFusion framework description (and abstract)] The central claim that CTFusion's two mitigations (per-agent independence under a shared team account + first-flag forwarding) fully neutralize both contamination and competition-impact problems requires stronger justification. The framework description does not specify how shared-account rate limits, platform logging, or sequential flag-submission order are prevented from creating observable differences between agents; without this, the 'robust solution' claim for live events rests on an untested isolation assumption.
  2. [Experiments (and abstract)] The experimental support for the unreliability of existing CTF benchmarks (via web-search integration) is only partially verifiable. The abstract and setup report results with three LLMs and two agents but omit full methods details, specific metrics, quantitative outcomes, or error analysis, weakening the load-bearing claim that current benchmarks are unreliable.
  3. [Live CTF setup and experiments] The assumption that live CTF events remain uncontaminated (and that the five selected events are representative) is not tested or discussed. Potential selection effects, prior exposure, or organizer-side leakage could still affect results, and no evidence is provided that the chosen live events avoid the contamination issues shown for static benchmarks.
minor comments (2)
  1. [Experiments] The abstract states experiments used 'five Live CTFs' but provides no list, table, or description of the specific events or challenges; adding this in the experimental section would improve reproducibility.
  2. [CTFusion implementation] Notation for agent independence and flag-forwarding logic could be clarified with a small diagram or pseudocode, as the current prose description leaves some implementation details ambiguous.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below, indicating where we agree revisions are needed to strengthen the presentation and justification.

read point-by-point responses
  1. Referee: [CTFusion framework description (and abstract)] The central claim that CTFusion's two mitigations (per-agent independence under a shared team account + first-flag forwarding) fully neutralize both contamination and competition-impact problems requires stronger justification. The framework description does not specify how shared-account rate limits, platform logging, or sequential flag-submission order are prevented from creating observable differences between agents; without this, the 'robust solution' claim for live events rests on an untested isolation assumption.

    Authors: We agree that the framework section would benefit from expanded technical details on the isolation mechanisms. In the revised manuscript we will elaborate on the MCP server implementation, specifying that submissions are queued server-side in a first-come-first-served manner without exposing order or timing to agents, that rate-limit handling occurs at the platform level to equalize impact across agents, and that logging strips any agent-identifying metadata. These design choices are already present in the released code; we will add explicit description and a short justification of why they prevent observable differences, thereby addressing the isolation assumption more directly. revision: yes

  2. Referee: [Experiments (and abstract)] The experimental support for the unreliability of existing CTF benchmarks (via web-search integration) is only partially verifiable. The abstract and setup report results with three LLMs and two agents but omit full methods details, specific metrics, quantitative outcomes, or error analysis, weakening the load-bearing claim that current benchmarks are unreliable.

    Authors: The full manuscript (Sections 3 and 4) already specifies the three LLMs, two agents, and the web-search integration experiment that demonstrates elevated success rates when external search is enabled. To improve verifiability we will expand the experiments section and add an appendix containing the complete method details (prompt templates, tool configurations), all quantitative success rates with and without search, and basic error analysis. The abstract will remain a high-level summary consistent with journal conventions. revision: partial

  3. Referee: [Live CTF setup and experiments] The assumption that live CTF events remain uncontaminated (and that the five selected events are representative) is not tested or discussed. Potential selection effects, prior exposure, or organizer-side leakage could still affect results, and no evidence is provided that the chosen live events avoid the contamination issues shown for static benchmarks.

    Authors: We will add a dedicated discussion subsection on the live-CTF experimental setup. It will describe the selection criteria for the five events (recency, diversity of challenge types, and public availability), explain why the live format inherently lowers the risk of pre-existing data contamination relative to static benchmarks, and acknowledge residual risks such as organizer-side leakage or selection effects. While exhaustive empirical verification of zero contamination is not feasible, the added discussion will make the assumptions and their limitations explicit. revision: yes

Circularity Check

0 steps flagged

No circularity in the engineering framework proposal

full rationale

The paper introduces CTFusion as an applied streaming evaluation framework on live CTFs, with per-agent independence and first-flag forwarding as design mitigations. No mathematical derivations, equations, fitted parameters, or self-citations appear in the provided text that reduce any central claim to its own inputs by construction. The confirmation of contamination issues is described as an empirical experiment with web-search tools, and the robustness claim rests on the stated engineering choices rather than any self-referential loop or renamed prior result. This is a self-contained applied contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that reused CTF challenges cause measurable contamination when agents have web access, and that live events plus the forwarding rule remove this without new biases.

axioms (2)
  • domain assumption LLM agents equipped with web search can solve or cheat on reused CTF challenges
    Used to demonstrate unreliability of existing benchmarks
  • domain assumption Live CTF events supply challenges that have not been seen by the evaluated models
    Core premise for contamination resistance
invented entities (1)
  • CTFusion streaming framework no independent evidence
    purpose: Enables independent per-agent evaluation on shared live CTF accounts
    New system introduced to solve the identified benchmark problems

pith-pipeline@v0.9.0 · 5505 in / 1260 out tokens · 55761 ms · 2026-05-13T01:33:21.599669+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

  1. [1]

    2024 , url=

    Minghao Shao and Sofija Jancheska and Meet Udeshi and Brendan Dolan-Gavitt and Haoran Xi and Kimberly Milner and Boyuan Chen and Max Yin and Siddharth Garg and Prashanth Krishnamurthy and Farshad Khorrami and Ramesh Karri and Muhammad Shafique , booktitle=. 2024 , url=

  2. [2]

    2024 , eprint=

    Benchmark Data Contamination of Large Language Models: A Survey , author=. 2024 , eprint=

  3. [3]

    2025 , eprint=

    Measuring and Augmenting Large Language Models for Solving Capture-the-Flag Challenges , author=. 2025 , eprint=

  4. [4]

    2023 , eprint=

    Don't Make Your LLM an Evaluation Benchmark Cheater , author=. 2023 , eprint=

  5. [5]

    Forty-second International Conference on Machine Learning , year=

    DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination , author=. Forty-second International Conference on Machine Learning , year=

  6. [6]

    2025 , eprint=

    LastingBench: Defend Benchmarks Against Knowledge Leakage , author=. 2025 , eprint=

  7. [7]

    LiveBench: A Challenging, Contamination-Free

    Colin White and Samuel Dooley and Manley Roberts and Arka Pal and Benjamin Feuer and Siddhartha Jain and Ravid Shwartz-Ziv and Neel Jain and Khalid Saifullah and Sreemanti Dey and Shubh-Agrawal and Sandeep Singh Sandha and Siddartha Venkat Naidu and Chinmay Hegde and Yann LeCun and Tom Goldstein and Willie Neiswanger and Micah Goldblum , booktitle=. LiveB...

  8. [8]

    2025 , eprint=

    Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark , author=. 2025 , eprint=

  9. [9]

    The Thirteenth International Conference on Learning Representations , year=

    Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  10. [10]

    2025 , eprint=

    Large Language Models for Cyber Security: A Systematic Literature Review , author=. 2025 , eprint=

  11. [11]

    Talor Abramovich and Meet Udeshi and Minghao Shao and Kilian Lieret and Haoran Xi and Kimberly Milner and Sofija Jancheska and John Yang and Carlos E Jimenez and Farshad Khorrami and Prashanth Krishnamurthy and Brendan Dolan-Gavitt and Muhammad Shafique and Karthik R Narasimhan and Ramesh Karri and Ofir Press , booktitle=. En. 2025 , url=

  12. [12]

    Network and Distributed System Security (NDSS) Symposium , year=

    YURASCANNER: Leveraging LLMs for Task-driven Web App Scanning , author=. Network and Distributed System Security (NDSS) Symposium , year=

  13. [13]

    2025 , eprint=

    D-CIPHER: Dynamic Collaborative Intelligent Multi-Agent System with Planner and Heterogeneous Executors for Offensive Security , author=. 2025 , eprint=

  14. [14]

    2024 , note =

    Waisman, Nico , title =. 2024 , note =

  15. [15]

    2025 , note =

    OpenAI , title =. 2025 , note =

  16. [16]

    2024 , note =

    OpenAI , title =. 2024 , note =

  17. [17]

    2025 , note =

    Anthropic , title =. 2025 , note =

  18. [18]

    2025 , note =

    DeepMind , title =. 2025 , note =

  19. [19]

    2017 USENIX Workshop on Advances in Security Education (ASE 17) , year =

    Kevin Chung , title =. 2017 USENIX Workshop on Advances in Security Education (ASE 17) , year =

  20. [20]

    2024 , month =

    Google Project Zero , title =. 2024 , month =

  21. [21]

    2024 , eprint=

    The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? , author=. 2024 , eprint=

  22. [22]

    arXiv preprint arXiv:2505.17107 , url=

    CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution , author=. arXiv preprint arXiv:2505.17107 , url=

  23. [23]

    Waisman, Nico , title =

  24. [24]

    2025 , month =

    OpenAI , title =. 2025 , month =

  25. [25]

    2024 , note =

    CTFtime — Event List for 2024 , author =. 2024 , note =

  26. [26]

    2025 , note =

    CTFtime — Event List for 2025 , author =. 2025 , note =

  27. [27]

    2018 , publisher =

    Sebastián Ramírez , title =. 2018 , publisher =

  28. [28]

    2010 , url =

    Flask: A Python Microframework , author =. 2010 , url =

  29. [29]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  30. [30]

    2024 , url =

    Claude 3 Model Card , institution =. 2024 , url =

  31. [31]

    Introducing GPT-4.1 in the API , year =

  32. [32]

    Gemini 2.5 Flash , year =

  33. [33]

    2013 , version =

    Docker , author =. 2013 , version =

  34. [34]

    DuckDuckGo---Protection

    DuckDuckGo, Inc. DuckDuckGo---Protection. Privacy. Peace of Mind. 2025

  35. [35]

    CTFd API , author =. 2025

  36. [36]

    2024 , eprint=

    An Empirical Evaluation of LLMs for Solving Offensive Security Challenges , author=. 2024 , eprint=

  37. [37]

    2025 , note =

    CTFtime , title =. 2025 , note =

  38. [38]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  39. [39]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  40. [40]

    M. J. Kearns , title =

  41. [41]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  42. [42]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  43. [43]

    Suppressed for Anonymity , author=

  44. [44]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  45. [45]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  46. [46]

    12th # EURODW, booktitle =

  47. [47]

    15th # HOTOS #

    15th # HOTOS #. 15th # HOTOS #

  48. [48]

    13th # FAST #

    13th # FAST #. 13th # FAST #

  49. [49]

    SIGOPS Oper. Syst. Rev. , year = 2016, month = mar, volume =

  50. [50]

    ACM Transactions on Information and System Security (TISSEC) , year = 2012, month = mar, volume =