pith. machine review for the scientific record. sign in

arxiv: 2605.12925 · v1 · submitted 2026-05-13 · 💻 cs.SE · cs.AI

Recognition: no theorem link

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:46 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords SWE-agent evaluationLucky Passprocess-level assessmenttrajectory qualitysoftware engineering agentsPrefix Tree AcceptorSWE-benchagent benchmarking
0
0 comments X

The pith

Binary pass rates in SWE-agent tests equate chaotic trial-and-error successes with systematic ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that judging software engineering agents only by whether their final patch passes tests treats lucky, disordered runs the same as principled ones. Analysis of 1,815 trajectories across 47 tasks found 10.7 percent of the passing ones contain regression cycles, blind retries, missing verification, or disordered sequencing of exploration and implementation. AgentLens addresses this by building task-level reference models from multiple successful runs and labeling each action by its intent in context. The resulting quality scores separate trajectories into Lucky, Solid, and Ideal tiers and expose wide differences in lucky rates across eight model backends. Rankings by quality instead of pass rate shift some models by as many as five positions.

Core claim

Among passing trajectories in the 1,815-trajectory subset, 10.7% exhibit Lucky Pass behavior consisting of regression cycles, blind retries, missing verification, or temporally disordered exploration, implementation, and verification. AgentLens constructs Prefix Tree Acceptor references by merging multiple passing solutions for each task and applies a context-sensitive labeler that assigns actions to Exploration, Implementation, Verification, or Orchestration based on trajectory history.

What carries the argument

Prefix Tree Acceptor (PTA) references formed by merging multiple passing trajectories, used to score new runs on quality and detect divergence into lucky mechanisms.

If this is right

  • Lucky Pass rates range from 0.5% to 23.2% across the eight model backends.
  • Ranking models by quality score instead of pass rate moves some models by up to five positions.
  • Lucky Passes decompose into five recurring mechanisms that can be measured separately.
  • Process-level scores distinguish chaotic successes from solid ones within the same passing set.
  • The framework supplies annotated trajectories and task references for further study of agent processes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If many passes are lucky, real-world deployment of these agents may encounter more failures once test oracles are imperfect.
  • Filtering training data to exclude Lucky trajectories could improve downstream agent reliability.
  • The PTA reference approach might extend to other agent domains where outcome-only evaluation hides process flaws.
  • Re-running the evaluation on the full SWE-bench Verified set could reveal whether the 10.7% rate holds beyond the 47 tasks with sufficient data.

Load-bearing premise

Merging multiple passing solutions into Prefix Tree Acceptor references produces a pure model of principled behavior without incorporating lucky elements from the source trajectories.

What would settle it

Manual review of the released AgentLens-Bench annotations to check whether trajectories scored as Lucky actually display the listed disordered behaviors such as regression cycles or absent verification steps.

Figures

Figures reproduced from arXiv: 2605.12925 by Benjamin Steenhoek, Gaurav Mittal, Pingping Lin, Priyam Sahoo, Shengjie Ma, Xiaomin Li, Yu Hu.

Figure 1
Figure 1. Figure 1: Passing trajectories are not behaviorally homogeneous. Among 1,136 passing trajectories in AGENTLENS￾Bench, AGENTLENS classifies 229 as Ideal (20.2%), 785 as Solid (69.1%), and 122 as Lucky (10.7%). Binary evaluation treats all of these trajectories as equally successful, while process-aware scoring separates direct, coherent solutions from weak processes that happen to pass. further 69.1% are Solid but im… view at source ↗
Figure 2
Figure 2. Figure 2: AGENTLENS workflow. AGENTLENS starts from raw execution traces and converts them into structured state sequences. Intent-stage labeling assigns each state a process role, such as exploration, implementation, testing, or cleanup. These intent-labeled states become the states used for per-trace PTA construction. Passing trajectories for the same task are then merged into a task-level PTA, which represents th… view at source ↗
Figure 3
Figure 3. Figure 3: PTA construction. Two individual passing traces sharing an early exploration prefix but diverging at imple￾mentation are merged into a single DAG. Shared nodes reflect equivalent actions across agents; branches reflect genuine strategic divergence. During construction, states from different trajectories are merged when they represent equivalent actions. The equivalence engine handles surface variation such… view at source ↗
Figure 4
Figure 4. Figure 4: Intent-stage classification decision tree. Each agent action is classified into one of four intent stages (Exploration, Implementation, Verification, Orchestration) via a priority cascade of seven rules. Rules 1 to 4 (gray diamonds) assign fixed stages based on tool type. Rules 5 to 7 (yellow diamonds) are context-sensitive: the same tool can map to different stages depending on whether a source-file edit … view at source ↗
Figure 5
Figure 5. Figure 5: Scoring signal examples. Each panel contrasts a principled trajectory (left) against a chaotic trajectory (right) on one of the four scoring dimensions: (a) coherence, (b) structural alignment, (c) set coverage, and (d) temporal profile divergence. Together, the four signals capture complementary aspects of process quality. the Ideal-versus-Lucky waste comparison that identifies the blind-retry severity fi… view at source ↗
Figure 6
Figure 6. Figure 6: Failure-mode gallery. Six annotated stage-colored timelines: regression loop, blind-retry cluster, temporal disorder, E/V confusion before and after the context-sensitive fix, unnecessary exploration, and a cyclic pattern. C.5 Heuristic Labeler Validation The main text reports aggregate labeler reliability (Section 5.4): Fleiss’ κ = 0.933 with 96.0% raw agreement across seven annotators. This subsection pr… view at source ↗
Figure 7
Figure 7. Figure 7: Score Density by Outcome. Overlapping density histograms of quality scores for Pass (n=1,136) and Fail (n=679) instances. The vertical dashed line marks the empirically chosen threshold at 46.4. Pass instances concentrate in the upper range while fail instances skew left. This confirms the score’s ability to separate outcomes. C.7 Baseline Results [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Pass Rate vs. Mean Quality Score. Each point represents one of eight LLM coding agents evaluated on AGENTLENS-BENCH. Dot color encodes rank divergence, defined as |PR − QS| where PR is the pass-rate rank and QS is the quality-score rank: • consistent (≤ 1), • moderate (2–3), • divergent (≥ 4). Arrows indicate whether a model’s quality rank improves (▲) or drops (▼) relative to its pass-rate rank. 0 5 10 15… view at source ↗
Figure 9
Figure 9. Figure 9 [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Lucky Pass category distribution. The 122 Lucky Passes decompose into five categories: C1 Minimal & Unverified (19, 15.6%), C2 Brute-Force Convergence (42, 34.4%), C3 Incomplete Implementation (41, 33.6%), C4 Excessive Exploration (5, 4.1%), and C5 Divergent-but-Valid (15, 12.3%). Categories C2 and C3 together account for 68% of all Lucky Passes. C1: Minimal & Unverified (n = 19, 15.6%). The agent finds a… view at source ↗
read the original abstract

Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome-only view treats a principled solution and a chaotic trial-and-error process as equivalent. We show that this equivalence is empirically false. We evaluate 2,614 OpenHands trajectories from eight model backends on 60 SWE-bench Verified tasks. Of these, 47 have enough passing trajectories to construct task-level process references, yielding a 1,815-trajectory evaluation subset. Among passing trajectories in this subset, 10.7% exhibit behavior we call a Lucky Pass: regression cycles, blind retries, missing verification, or temporally disordered exploration, implementation, and verification. We introduce AgentLens, a framework for process-level assessment of SWE-agent trajectories, and release AgentLens-Bench, a dataset of 1,815 trajectories annotated with quality scores, waste signals, divergence points, and 47 task-level Prefix Tree Acceptor (PTA) references. AgentLens builds PTA references by merging multiple passing solutions for the same task, and uses a context-sensitive intent labeler to assign actions to Exploration, Implementation, Verification, or Orchestration based on trajectory history rather than tool identity alone. On AgentLens-Bench, the quality score separates passing trajectories into Lucky, Solid, and Ideal tiers and further decomposes Lucky Passes into five recurring mechanisms. Across the eight model backends, Lucky rates range from 0.5% to 23.2%, and some models move by as many as five rank positions when ranked by quality score instead of pass rate. We release the anonymized project repository, including the AgentLens-Bench dataset and AgentLens SDK, at https://github.com/microsoft/code-agent-state-trajectories/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that binary pass/fail evaluation of SWE agents is insufficient, as many passing trajectories involve inefficient 'Lucky Pass' processes (regression cycles, blind retries, missing verification, or temporally disordered exploration/implementation/verification). Using AgentLens on 1,815 trajectories from 2,614 total across eight models and 60 SWE-bench Verified tasks (47 tasks with sufficient passes), they construct task-level Prefix Tree Acceptor (PTA) references by merging passing solutions, apply a context-sensitive intent labeler to categorize actions, and derive quality scores that partition trajectories into Lucky/Solid/Ideal tiers. This yields a 10.7% lucky pass rate, model-specific lucky rates from 0.5% to 23.2%, and rank shifts of up to five positions when using quality scores versus pass rates. The work releases AgentLens-Bench (annotated trajectories and 47 PTA references) and an SDK.

Significance. If the central measurements are robust, the result is significant because it demonstrates that outcome-only metrics can mask substantial process differences in agent behavior, with direct implications for benchmark design and model ranking in software engineering agents. The open release of the annotated dataset, PTA references, and SDK is a clear strength for reproducibility and enables follow-on work on process-aware evaluation. The empirical rank-shift evidence provides a falsifiable hook for the community to test whether quality tiers predict downstream properties such as maintainability or generalization.

major comments (2)
  1. [§3] §3 (PTA reference construction): The PTA references are formed by merging all passing trajectories per task without an explicit quality filter or iterative removal of trajectories later classified as lucky passes. Because the source set contains the same regression cycles and disordered sequences that the framework later flags, the acceptor may encode non-principled paths as canonical; this directly affects the validity of the 10.7% lucky-pass statistic and the quality-tier separation. A sensitivity analysis that rebuilds PTAs after removing candidate lucky trajectories, or a description of any ad-hoc filtering, is required to establish that the reference truly captures principled behavior.
  2. [§4.2] §4.2 and Table 2 (quality-score derivation and five-mechanism decomposition): The quality tiers and the breakdown of the 10.7% into five recurring mechanisms rest on the context-sensitive intent labeler and subsequent annotation. The manuscript must report inter-annotator agreement (e.g., Cohen’s κ or percentage agreement) for the Exploration/Implementation/Verification/Orchestration labels and for the lucky-pass mechanism tags; without these metrics it is impossible to assess whether the reported percentages and rank shifts are stable under annotation noise.
minor comments (2)
  1. [§2] Abstract and §2: The total of 2,614 trajectories is reduced to the 1,815-trajectory subset for the 47 tasks that have enough passes; the exact selection rule (minimum number of passes per task, any other filters) should be stated explicitly in the main text rather than only in the abstract.
  2. [Figure 3] Figure 3 (rank-shift visualization): Ensure that the legend clearly distinguishes pass-rate ranking from quality-score ranking and that error bars or confidence intervals are shown for the per-model lucky rates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We believe the concerns raised can be addressed through additional analysis and reporting in the revised manuscript, strengthening the robustness of our claims.

read point-by-point responses
  1. Referee: [§3] §3 (PTA reference construction): The PTA references are formed by merging all passing trajectories per task without an explicit quality filter or iterative removal of trajectories later classified as lucky passes. Because the source set contains the same regression cycles and disordered sequences that the framework later flags, the acceptor may encode non-principled paths as canonical; this directly affects the validity of the 10.7% lucky-pass statistic and the quality-tier separation. A sensitivity analysis that rebuilds PTAs after removing candidate lucky trajectories, or a description of any ad-hoc filtering, is required to establish that the reference truly captures principled behavior.

    Authors: We appreciate this observation regarding the potential circularity in PTA construction. While the merging process in Prefix Tree Acceptors inherently captures the most common paths across trajectories, and lucky passes are defined as significant deviations from this merged structure, we acknowledge that including all passes could introduce some noise. In the revised version, we will conduct a sensitivity analysis by iteratively removing trajectories classified as lucky passes and rebuilding the PTAs to verify stability of the quality scores and lucky-pass rates. This will be added to §3. revision: yes

  2. Referee: [§4.2] §4.2 and Table 2 (quality-score derivation and five-mechanism decomposition): The quality tiers and the breakdown of the 10.7% into five recurring mechanisms rest on the context-sensitive intent labeler and subsequent annotation. The manuscript must report inter-annotator agreement (e.g., Cohen’s κ or percentage agreement) for the Exploration/Implementation/Verification/Orchestration labels and for the lucky-pass mechanism tags; without these metrics it is impossible to assess whether the reported percentages and rank shifts are stable under annotation noise.

    Authors: We agree that reporting inter-annotator agreement is essential for validating the annotation process. The context-sensitive intent labeler combines automated rules with human verification for ambiguous cases. In the revision, we will include Cohen’s κ scores for both the intent labels (Exploration/Implementation/Verification/Orchestration) and the lucky-pass mechanism tags, based on a double-annotation of a subset of trajectories. This addition will appear in §4.2. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent annotations and merged references

full rationale

The paper constructs task-level PTA references by merging multiple passing trajectories and applies a context-sensitive intent labeler plus quality annotations to identify Lucky Passes (regression cycles, blind retries, missing verification, disordered sequences) at 10.7%. Quality scores and tiering into Lucky/Solid/Ideal are built from these annotations and references rather than from the binary pass-rate signal or any fitted parameter. No self-citation load-bearing steps, uniqueness theorems imported from authors, or ansatzes smuggled via prior work appear in the derivation chain. The central empirical claims rest on the released AgentLens-Bench dataset and explicit behavioral signals, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that merged passing trajectories form a reliable reference for principled behavior and that context-sensitive labeling accurately captures intent independent of tool identity.

free parameters (1)
  • quality tier thresholds
    Cutoffs separating Lucky, Solid, and Ideal trajectories are not specified in the abstract.
axioms (1)
  • domain assumption Multiple passing trajectories for the same task can be merged into a Prefix Tree Acceptor that represents ideal solution structure.
    Invoked when constructing the 47 task-level references used for quality scoring.
invented entities (1)
  • Lucky Pass no independent evidence
    purpose: Category for passing trajectories that exhibit regression cycles, blind retries, or disordered process steps.
    Defined from observed behaviors in the 1815-trajectory subset.

pith-pipeline@v0.9.0 · 5641 in / 1323 out tokens · 70456 ms · 2026-05-14T18:46:32.801828+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 13 internal anchors

  1. [1]

    arXiv preprint arXiv:2410.20285 , year=

    Swe-search: Enhancing software agents with monte carlo tree search and iterative refinement , author=. arXiv preprint arXiv:2410.20285 , year=

  2. [2]

    Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411, 2025

    Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents , author=. arXiv preprint arXiv:2505.20411 , year=

  3. [3]

    arXiv preprint arXiv:2506.18824 , year=

    Understanding software engineering agents: A study of thought-action-result trajectories , author=. arXiv preprint arXiv:2506.18824 , year=

  4. [4]

    2026 , note =

    Agent Trajectory Format (ATIF) , author =. 2026 , note =

  5. [5]

    doi:10.5281/zenodo.19357078 , url =

    Max Brunsfeld and Amaan Qureshi and Andrew Hlynskyi and Will Lillis and ObserverOfTime and Christian Clason and dundargoc and Phil Turnbull and Timothy Clem and Douglas Creager and Andrew Helwer and Riley Bruins and Antonin Delpeuch and Daumantas Kavolis and Michael Davis and Ika and bfredl and Tuấn-Anh Nguyễn and Amin Ya and Stafford Brunk and skewb1k an...

  6. [6]

    Why Do Multi-Agent LLM Systems Fail?

    Why do multi-agent llm systems fail? , author=. arXiv preprint arXiv:2503.13657 , year=

  7. [7]

    International Journal of Innovative Research in Technology , year =

    Sana Ansari, Sakshi Kini , title =. International Journal of Innovative Research in Technology , year =

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  9. [9]

    TRAIL: Trace reasoning and agentic issue localization.arXiv preprint arXiv:2505.08638, 2025

    Trail: Trace reasoning and agentic issue localization , author=. arXiv preprint arXiv:2505.08638 , year=

  10. [10]

    Pattern recognition letters , volume=

    An introduction to ROC analysis , author=. Pattern recognition letters , volume=. 2006 , publisher=

  11. [11]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

  12. [12]

    R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents,

    R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents , author=. arXiv preprint arXiv:2504.07164 , year=

  13. [13]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

  14. [14]

    IEEE Transactions on software engineering , volume=

    An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks , author=. IEEE Transactions on software engineering , volume=. 2006 , publisher=

  15. [15]

    Science , volume=

    Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

  16. [16]

    The twelfth international conference on learning representations , year=

    Let's verify step by step , author=. The twelfth international conference on learning representations , year=

  17. [17]

    IEEE Transactions on Information theory , volume=

    Divergence measures based on the Shannon entropy , author=. IEEE Transactions on Information theory , volume=. 2002 , publisher=

  18. [18]

    AgentBench: Evaluating LLMs as Agents

    Agentbench: Evaluating llms as agents , author=. arXiv preprint arXiv:2308.03688 , year=

  19. [19]

    Proceedings of the ACM on Programming Languages , volume=

    Process-Centric Analysis of Agentic Software Systems , author=. Proceedings of the ACM on Programming Languages , volume=. 2026 , publisher=

  20. [20]

    Advances in neural information processing systems , volume=

    Agentboard: An analytical evaluation board of multi-turn llm agents , author=. Advances in neural information processing systems , volume=

  21. [21]

    Advances in neural information processing systems , volume=

    Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

  22. [22]

    arXiv preprint arXiv:2511.00197 , year=

    Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories , author=. arXiv preprint arXiv:2511.00197 , year=

  23. [23]

    SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance soft- ware engineering?arXiv preprint, 2025

    Swe-lancer: Can frontier llms earn \ 1 million from real-world freelance software engineering? , author=. arXiv preprint arXiv:2502.12115 , year=

  24. [24]

    Pattern recognition and image analysis , volume=

    Inferring regular languages in polynomial update time , author=. Pattern recognition and image analysis , volume=. 1992 , publisher=

  25. [26]

    Training software engineering agents and verifiers with swe-gym, 2024.URL https://arxiv

    Training software engineering agents and verifiers with swe-gym , author=. arXiv preprint arXiv:2412.21139 , year=

  26. [27]

    arXiv preprint arXiv:2505.15277 , year=

    Web-shepherd: Advancing prms for reinforcing web agents , author=. arXiv preprint arXiv:2505.15277 , year=

  27. [28]

    Advances in neural information processing systems , volume=

    Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

  28. [29]

    Advances in neural information processing systems , volume=

    Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

  29. [30]

    arXiv preprint arXiv:2512.21919 , year=

    SWE-RM: Execution-free Feedback For Software Engineering Agents , author=. arXiv preprint arXiv:2512.21919 , year=

  30. [31]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces , author=. arXiv preprint arXiv:2601.11868 , year=

  31. [32]

    Solving math word problems with process- and outcome-based feedback

    Solving math word problems with process-and outcome-based feedback , author=. arXiv preprint arXiv:2211.14275 , year=

  32. [33]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Wang, Xingyao and Chen, Yangyi and Yuan, Lifan and Zhang, Yizhe and Li, Yunzhu and Peng, Hao and Ji, Heng , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  33. [34]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Voyager: An open-ended embodied agent with large language models , author=. arXiv preprint arXiv:2305.16291 , year=

  34. [35]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  35. [36]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Openhands: An open platform for ai software developers as generalist agents , author=. arXiv preprint arXiv:2407.16741 , year=

  36. [37]

    Multi-swe-bench: A multilingual benchmark for issue resolving.arXiv preprint arXiv:2504.02605, 2025

    Multi-swe-bench: A multilingual benchmark for issue resolving , author=. arXiv preprint arXiv:2504.02605 , year=

  37. [38]

    SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

    Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution , author=. arXiv preprint arXiv:2502.18449 , year=

  38. [39]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Agentless: Demystifying llm-based software engineering agents , author=. arXiv preprint arXiv:2407.01489 , year=

  39. [40]

    emnlp-industry.37/

    Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and Liu, Yitao and Xu, Yiheng and Zhou, Shuyan and Savarese, Silvio and Xiong, Caiming and Zhong, Victor and Yu, Tao , booktitle =. OSWorld: Benchmarking Multimodal Agents for Open-En...

  40. [41]

    Bradley , keywords =

    Andrew P. Bradley , keywords =. The use of the area under the ROC curve in the evaluation of machine learning algorithms , journal =. 1997 , issn =. doi:https://doi.org/10.1016/S0031-3203(96)00142-2 , url =

  41. [42]

    Massey , journal =

    Frank J. Massey , journal =. The Kolmogorov-Smirnov Test for Goodness of Fit , urldate =

  42. [43]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    Yang, John and Jimenez, Carlos and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , booktitle =. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , url =. doi:10.52202/079017-1601 , editor =

  43. [44]

    2025 , eprint=

    SWE-smith: Scaling Data for Software Engineering Agents , author=. 2025 , eprint=

  44. [45]

    2023 , eprint=

    ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

  45. [46]

    and Cao, Yuan and Narasimhan, Karthik , title =

    Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas L. and Cao, Yuan and Narasimhan, Karthik , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

  46. [47]

    Youden, W. J. , title =. Cancer , volume =. doi:https://doi.org/10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3 , url =. https://acsjournals.onlinelibrary.wiley.com/doi/pdf/10.1002/1097-0142 year =

  47. [48]

    2020 , eprint=

    BERTScore: Evaluating Text Generation with BERT , author=. 2020 , eprint=

  48. [49]

    2024 , isbn =

    Zhang, Yuntong and Ruan, Haifeng and Fan, Zhiyu and Roychoudhury, Abhik , title =. 2024 , isbn =. doi:10.1145/3650212.3680384 , booktitle =

  49. [50]

    P rocess B ench: Identifying Process Errors in Mathematical Reasoning

    Zheng, Chujie and Zhang, Zhenru and Zhang, Beichen and Lin, Runji and Lu, Keming and Yu, Bowen and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang. P rocess B ench: Identifying Process Errors in Mathematical Reasoning. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/20...

  50. [51]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. arXiv preprint arXiv:2307.13854 , url=

  51. [52]

    2025 , eprint=

    Establishing Best Practices for Building Rigorous Agentic Benchmarks , author=. 2025 , eprint=

  52. [53]

    2024 , eprint=

    Agent-as-a-Judge: Evaluate Agents with Agents , author=. 2024 , eprint=

  53. [54]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions , author=. arXiv preprint arXiv:2406.15877 , year=

  54. [55]

    and LaToza, T

    Alaboudi, A. and LaToza, T. D. , title =

  55. [56]

    Moatless tools: An open-source SWE-agent toolkit , year =

  56. [57]

    and others , title =

    Brunsfeld, M. and others , title =

  57. [58]

    Jimenez, C. E. and others , title =

  58. [59]

    , title =

    Lin, J. , title =. IEEE Transactions on Information Theory , volume =

  59. [60]

    and others , title =

    Pan, J. and others , title =

  60. [61]

    2505.23419 , archivePrefix =

    Zhang, Linghao and others , year =. 2505.23419 , archivePrefix =

  61. [62]

    and others , title =

    Xie, T. and others , title =