arxiv: 2605.07830 · v1 · submitted 2026-05-08 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

Taein Lim , Seongyong Ju , Munhyeok Kim , Hyunjun Kim , Hoki Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:51 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM agentscybersecurityattack selection biasbenchmarkingbias momentumoffensive securityAI behavior

0 comments

The pith

LLM agents for cyber attacks each concentrate on their own narrow set of attack families even when prompts vary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CyBiasBench to test how five different LLM agents perform when tasked with cyber attacks against three targets. It finds that every agent disproportionately selects from only a few of the ten attack families, and this pattern holds across four different prompt conditions. The concentration appears to be a stable trait of each agent rather than a strategy tied to higher success rates. When the agents are explicitly steered toward families outside their usual pattern, they resist the shift and show no measurable gain in overall attack performance.

Core claim

The paper establishes that autonomous LLM agents in offensive cybersecurity exhibit distinct attack-selection biases, each concentrating its choices on a narrow subset of attack families irrespective of prompt variations. These biases are intrinsic traits of the agents and are uncorrelated with attack success rates. The work further identifies a bias momentum effect in which agents resist explicit redirection toward conflicting attack families, and such forced distribution shifts produce no improvement in attack performance.

What carries the argument

CyBiasBench, a 630-session benchmark that records attack-family allocation distributions and entropy across five agents, three targets, four prompt conditions, and ten attack families.

If this is right

Each agent displays its own dominant attack families and distinct entropy levels in how it allocates choices across families.
Attack-selection bias is better understood as a fixed agent characteristic than as a variable linked to success rate.
Agents actively resist prompt-based attempts to shift them toward attack families outside their established pattern.
Forcing a change in attack-family distribution produces no measurable increase in overall attack performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety testing for LLM agents in security roles should routinely measure persistent choice biases instead of focusing solely on success metrics.
Prompt-based control may have limited effect on agent behavior in adversarial settings because of the observed resistance to distribution shifts.
Applying the same evaluation protocol to additional target environments would help clarify whether the biases persist beyond the current benchmark design.

Load-bearing premise

The observed attack-selection biases and resistance to steering are inherent traits of the LLM agents rather than artifacts of the chosen targets, prompts, or attack family definitions.

What would settle it

Repeating the full set of sessions on a new collection of cyber targets or with redefined attack families and finding that the same agents still favor the identical dominant families while resisting the same steering attempts would support the claim; if preferences shift substantially with the new setup, the inherent-trait interpretation would be falsified.

Figures

Figures reproduced from arXiv: 2605.07830 by Hoki Kim, Hyunjun Kim, Munhyeok Kim, Seongyong Ju, Taein Lim.

**Figure 2.** Figure 2: Overview of CyBiasBench. (Stage 1) Prompt Design. Four prompt conditions are constructed along guidance and output-structure axes. (Stage 2) Agent Penetration Testbed. Each agent under test (AUT) runs in an isolated Kali Linux container and attacks three target applications; the highlighted agent–target pair indicates one example session. (Stage 3) Evaluation Metrics Suite. Logged HTTP traces are classifie… view at source ↗

**Figure 3.** Figure 3: Selection Rate (Seli) heatmap per (agent, family) cell, shown separately for each prompt condition. Within each panel, each cell is averaged over the three target applications and three repetitions for that prompt condition. Higher Seli reflects more frequent selection of that attack family in HTTP requests. “−” indicates that the family was not attempted. and 2.376; Selection CR1 32.1% and 34.5%), whereas… view at source ↗

**Figure 4.** Figure 4: Kruskal–Wallis effect sizes (η 2 ) for across-agent differences: H(X) 0.43, uniquefamily count 0.56, Selection CR1 0.27, session ASR 0.05. Structural metrics separate agents more strongly than session ASR. Frequently selected attack does not guarantee high attack performance [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Cell-level compliance in the bias injection setting against (left) free-choice [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Session ASR under free-choice and bias injection. All agents show mean ∆ASR ≤ 0 under steering. Explicit steering lowers mean session ASR [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Per-family attack-performance transfer. Points are agent–family cells; the y-axis is requested-family ASR under steering. Bias injection does not improve requested-family attack performance [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: The main text additionally reports a 2-axis marginal Prompt-stability JSD that pools the [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 8.** Figure 8: Prompt-condition stability by agent. Lines show normalized condition-specific family [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Confusion matrix (rows normalized to recall) for agent identification from session-level [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: CyBiasBench interactive result dashboard. The dashboard provides a companion interface [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly deployed as autonomous agents in offensive cybersecurity. In this paper, we reveal an interesting phenomenon: different agents exhibit distinct attack patterns. Specifically, each agent exhibits an attack-selection bias, disproportionately concentrating its efforts on a narrow subset of attack families regardless of prompt variations. To systematically quantify this behavior, we introduce CyBiasBench, a comprehensive 630-session benchmark that evaluates five agents on three targets and four prompt conditions with ten attack families. We identify explicit bias across agents, with different dominant attack families and varying entropy levels in their attack-family allocation distributions. Such bias is better characterized as a trait of the agents, rather than a factor associated with the attack success rate. Furthermore, our experiments reveal a bias momentum effect, where agents resist explicit steering toward attack families that conflict with their bias. This forced distribution shift does not yield measurable improvements in attack performance. To ensure reproducibility and facilitate future research, we release an interactive result dashboard at https://trustworthyai.co.kr/CyBiasBench/ and a reproducibility artifact with aggregated session-level statistics and full evaluation scripts at https://github.com/Harry24k/CyBiasBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces CyBiasBench to document persistent attack-family preferences in LLM cyber agents and a momentum effect under steering, but the evidence that these are intrinsic traits rather than target or success artifacts is still preliminary.

read the letter

The main thing to know is that the authors built CyBiasBench around 630 sessions with five agents, three targets, four prompt conditions, and ten attack families. They show each agent concentrates on a narrow subset of families even when prompts vary, treat this concentration as an agent trait rather than something tied to success rate, and report that forcing a shift toward other families produces no performance gain and meets resistance. They also release a dashboard and the full evaluation scripts, which is the most immediately useful part of the work.

Referee Report

2 major / 2 minor

Summary. The paper introduces CyBiasBench, a 630-session benchmark evaluating five LLM agents on three targets under four prompt conditions across ten attack families. It claims that each agent exhibits a distinct attack-selection bias, concentrating on a narrow subset of families independent of prompt variations; that this bias is an inherent agent trait uncorrelated with attack success rates; and that agents display a 'bias momentum effect,' resisting explicit steering toward conflicting families without any measurable gain in attack performance.

Significance. If the claims hold after addressing potential confounds, the work would demonstrate persistent, model-specific behavioral traits in LLM-based cyber agents that resist correction and are decoupled from performance optimization. This has direct implications for the reliability of autonomous offensive security tools. The release of an interactive dashboard and full reproducibility artifacts (session statistics and scripts) is a clear strength, enabling community validation and extension.

major comments (2)

[§4] §4 (Results on attack-family distributions): The claim that bias 'is better characterized as a trait of the agents, rather than a factor associated with the attack success rate' requires explicit controls such as per-family success-rate stratification by target or correlation analysis between concentration metrics and success; with only ~10 sessions per agent-target-prompt cell, observed concentrations could arise from target-specific viability rather than agent identity.
[§5] §5 (Bias momentum experiments): The assertion of a 'bias momentum effect' where forced distribution shifts yield no performance improvement must include quantitative before/after success-rate comparisons, confidence intervals, and statistical tests (e.g., paired t-tests or Wilcoxon) to confirm the lack of measurable gains; small per-cell samples make this evidence load-bearing for the steering-resistance claim.

minor comments (2)

[Abstract] Abstract: Provide a brief breakdown of the 630 sessions (e.g., sessions per agent or per cell) to clarify statistical power and design balance.
[Results figures] Figures in results: Add error bars or bootstrap intervals to entropy and distribution plots to convey variability across the limited sessions per condition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough and constructive review. We address each major comment point by point below. Where the comments identify areas for strengthening, we have revised the manuscript with additional analyses while preserving the original claims supported by our data.

read point-by-point responses

Referee: [§4] §4 (Results on attack-family distributions): The claim that bias 'is better characterized as a trait of the agents, rather than a factor associated with the attack success rate' requires explicit controls such as per-family success-rate stratification by target or correlation analysis between concentration metrics and success; with only ~10 sessions per agent-target-prompt cell, observed concentrations could arise from target-specific viability rather than agent identity.

Authors: We appreciate the referee highlighting this potential confound. In the revised manuscript, we have added explicit controls in §4: per-family success rates are now stratified by target for each agent, and we report Pearson correlations between concentration metrics (entropy and dominant-family share) and per-family success rates across sessions. These correlations are weak and non-significant (r < 0.2, p > 0.1), indicating that biases are not driven by success-rate differences. We also include visualizations showing that agent preferences persist across targets despite varying family viability. We acknowledge the ~10 sessions per cell as a limitation and have expanded the discussion in §4 and the limitations section, noting that future work will scale up sessions. These additions reinforce the trait interpretation without altering our core findings. revision: yes
Referee: [§5] §5 (Bias momentum experiments): The assertion of a 'bias momentum effect' where forced distribution shifts yield no performance improvement must include quantitative before/after success-rate comparisons, confidence intervals, and statistical tests (e.g., paired t-tests or Wilcoxon) to confirm the lack of measurable gains; small per-cell samples make this evidence load-bearing for the steering-resistance claim.

Authors: We agree that quantitative statistical support is necessary. The revised §5 now includes before/after success-rate comparisons for original versus steered conditions, with means, 95% confidence intervals, and statistical tests (paired t-tests or Wilcoxon signed-rank tests as appropriate). Results show no significant performance gains (p > 0.05) with negligible effect sizes, presented in a new table and figures broken down by agent and target. We retain the bias-momentum interpretation while adding caveats on sample size and a brief power discussion. This provides the requested rigor without changing the reported outcomes. revision: yes

Circularity Check

0 steps flagged

Observational benchmarking with no circularity in derivation chain

full rationale

The paper introduces CyBiasBench as a new empirical evaluation framework and reports measured attack-family distributions, entropy values, and success rates across 630 sessions. All central claims (agent-specific bias as a trait independent of success rate, and bias momentum under steering) rest on direct counts and comparisons from the experimental runs rather than any equations, fitted parameters, or self-citations that reduce the results to their own inputs by construction. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the empirical observations from the new benchmark; no free parameters are described. The work introduces two new descriptive entities without independent falsifiable evidence outside the benchmark itself.

axioms (1)

domain assumption LLM agents can be evaluated in simulated cyber-attack sessions using standardized prompt conditions and attack-family taxonomies
Invoked throughout the benchmark construction and result interpretation in the abstract.

invented entities (2)

attack-selection bias no independent evidence
purpose: To label the observed disproportionate concentration on narrow attack-family subsets as an agent trait
New term coined to characterize the main phenomenon; no independent evidence supplied beyond the benchmark data.
bias momentum effect no independent evidence
purpose: To label the observed resistance to explicit steering toward conflicting attack families
New term coined for the secondary finding; no independent evidence supplied beyond the benchmark data.

pith-pipeline@v0.9.0 · 5517 in / 1554 out tokens · 102261 ms · 2026-05-11T01:51:22.819438+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce CyBiasBench, a comprehensive 630-session benchmark that evaluates five agents on three targets and four prompt conditions with ten attack families... Entropy H(X), Selection Rate Sel_i, Attack Success Rate, Prompt-stability JSD.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Such bias is better characterized as a trait of the agents, rather than a factor associated with the attack success rate.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 5 internal anchors

[1]

Claude-opus-4-5overview

Anthropic. Claude-opus-4-5overview. https://www.anthropic.com/news/ claude-opus-4-5, November 2025. Accessed: 2026-04-16

work page 2025
[2]

CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, et al. Cyberseceval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

work page arXiv 2024
[3]

{PentestGPT}: Evaluating and harnessing large language models for automated penetration testing

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. {PentestGPT}: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Security 24), pages 847–864, 2024

work page 2024
[4]

Cognitive bias in decision-making with llms

Jessica Maria Echterhoff, Yao Liu, Abeer Alessa, Julian McAuley, and Zexue He. Cognitive bias in decision-making with llms. InFindings of the association for computational linguistics: EMNLP 2024, pages 12640–12653, 2024

work page 2024
[5]

Llm agents can au- tonomously exploit one-day vulnerabilities

Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. Llm agents can autonomously exploit one-day vulnerabilities.arXiv preprint arXiv:2404.08144, 2024

work page arXiv 2024
[6]

Gemini API overview

Google. Gemini API overview. https://ai.google.dev/gemini-api/docs, 2026. Ac- cessed: 2026-04-09

work page 2026
[7]

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis.arXiv preprint arXiv:2405.07987, 2024

work page Pith review arXiv 2024
[8]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Benchmarking cognitive biases in large language models as evaluators

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking cognitive biases in large language models as evaluators. InFindings of the Association for Computational Linguistics: ACL 2024, pages 517–545, 2024

work page 2024
[10]

Your ai, not your view: The bias of llms in investment analysis

Hoyoung Lee, Junhyuk Seo, Suhwan Park, Junhyeong Lee, Wonbin Ahn, Chanyeol Choi, Alejandro Lopez-Lira, and Yongjae Lee. Your ai, not your view: The bias of llms in investment analysis. InProceedings of the 6th ACM International Conference on AI in Finance, pages 150–158, 2025

work page 2025
[11]

J. Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151, 1991. doi: 10.1109/18.61115

work page doi:10.1109/18.61115 1991
[12]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review arXiv 2023
[13]

Agentboard: An analytical evaluation board of multi-turn llm agents.Advances in neural information processing systems, 37:74325–74362, 2024

Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn llm agents.Advances in neural information processing systems, 37:74325–74362, 2024

work page 2024
[14]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 conference on empirical methods in natural language processing, pages 11048–11064, 2022

work page 2022
[15]

Common attack pattern enumeration and classification (capec),

MITRE Corporation. Common attack pattern enumeration and classification (capec), . URL <https://capec.mitre.org/>

work page
[16]

Common weakness enumeration (cwe),

MITRE Corporation. Common weakness enumeration (cwe), . URL <https://cwe.mitre. org/>

work page
[17]

Introducing GPT-5.2 codex

OpenAI. Introducing GPT-5.2 codex. https://openai.com/index/ introducing-gpt-5-2-codex/, 2025. Accessed: 2026-04-28. 10

work page 2025
[18]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[19]

OWASP ModSecurity core rule set

OWASP CRS Project. OWASP ModSecurity core rule set. https://coreruleset.org/. Accessed: 2026-04-28

work page 2026
[20]

Owasp web security testing guide

OWASP Foundation. Owasp web security testing guide. URL <https://owasp.org/ www-project-web-security-testing-guide/>

work page
[21]

Owasp top 10 web application security risks, 2025

OWASP Foundation. Owasp top 10 web application security risks, 2025. URL <https: //owasp.org/www-project-top-ten/>

work page 2025
[22]

Quantifying language models’ sensitivity to spurious features in prompt design.arXiv preprint arXiv:2310.11324, 2023

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting.arXiv preprint arXiv:2310.11324, 2023

work page arXiv 2023
[23]

A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

Claude Elwood Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

work page 1948
[24]

Task Matters: Knowledge Requirements Shape LLM Responses to Context-Memory Conflict

Kaiser Sun, Fan Bai, and Mark Dredze. Task matters: Knowledge requirements shape llm responses to context-memory conflict, 2026. URLhttps://arxiv.org/abs/2506.06485

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[27]

We’re different, we’re the same: Creative homogeneity across llms.arXiv preprint arXiv:2501.19361,

Emily Wenger and Yoed Kenett. We’re different, we’re the same: Creative homogeneity across llms.arXiv preprint arXiv:2501.19361, 2025

work page arXiv 2025
[28]

Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts

Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=auKAUJZMO6

work page 2024
[29]

and Li, Z., 2024

Jiacen Xu, Jack W Stokes, Geoff McDonald, Xuesong Bai, David Marshall, Siyue Wang, Adith Swaminathan, and Zhou Li. Autoattacker: A large language model guided system to implement automatic cyber-attacks.arXiv preprint arXiv:2403.01038, 2024

work page arXiv 2024
[30]

Swe-agent: Agent-computer interfaces enable automated software engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024
[31]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=WE_vluYUL-X

work page 2023
[32]

Z.AI. GLM-5.1. https://docs.z.ai/guides/llm/glm-5.1, 2026. Accessed: 2026-04-28

work page 2026
[33]

Cybench: A framework for evaluating cybersecurity capabilities and risks of language models

Andy K Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper, et al. Cybench: A framework for evaluating cybersecurity capabilities and risks of language models.arXiv preprint arXiv:2408.08926, 2024

work page arXiv 2024
[34]

Zhang, Joey Ji, Celeste Menders, Riya Dulepet, T

Andy K Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, et al. Bountybench: Dollar impact of ai agent attackers and defenders on real-world cybersecurity systems.arXiv preprint arXiv:2505.15216, 2025. 11

work page arXiv 2025
[35]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang

Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, et al. Cve-bench: a benchmark for ai agents’ ability to exploit real-world web application vulnerabilities.arXiv preprint arXiv:2503.17332, 2025

work page arXiv 2025
[37]

timestamp

Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, and Daniel Kang. Teams of llm agents can exploit zero-day vulnerabilities. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23–35, 2026. 12 A Free-choice Prompt Templates This section ...

work page 2026