arxiv: 2605.12925 · v1 · submitted 2026-05-13 · 💻 cs.SE · cs.AI

Recognition: no theorem link

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Priyam Sahoo , Gaurav Mittal , Xiaomin Li , Shengjie Ma , Benjamin Steenhoek , Pingping Lin , Yu Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:46 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords SWE-agent evaluationLucky Passprocess-level assessmenttrajectory qualitysoftware engineering agentsPrefix Tree AcceptorSWE-benchagent benchmarking

0 comments

The pith

Binary pass rates in SWE-agent tests equate chaotic trial-and-error successes with systematic ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that judging software engineering agents only by whether their final patch passes tests treats lucky, disordered runs the same as principled ones. Analysis of 1,815 trajectories across 47 tasks found 10.7 percent of the passing ones contain regression cycles, blind retries, missing verification, or disordered sequencing of exploration and implementation. AgentLens addresses this by building task-level reference models from multiple successful runs and labeling each action by its intent in context. The resulting quality scores separate trajectories into Lucky, Solid, and Ideal tiers and expose wide differences in lucky rates across eight model backends. Rankings by quality instead of pass rate shift some models by as many as five positions.

Core claim

Among passing trajectories in the 1,815-trajectory subset, 10.7% exhibit Lucky Pass behavior consisting of regression cycles, blind retries, missing verification, or temporally disordered exploration, implementation, and verification. AgentLens constructs Prefix Tree Acceptor references by merging multiple passing solutions for each task and applies a context-sensitive labeler that assigns actions to Exploration, Implementation, Verification, or Orchestration based on trajectory history.

What carries the argument

Prefix Tree Acceptor (PTA) references formed by merging multiple passing trajectories, used to score new runs on quality and detect divergence into lucky mechanisms.

If this is right

Lucky Pass rates range from 0.5% to 23.2% across the eight model backends.
Ranking models by quality score instead of pass rate moves some models by up to five positions.
Lucky Passes decompose into five recurring mechanisms that can be measured separately.
Process-level scores distinguish chaotic successes from solid ones within the same passing set.
The framework supplies annotated trajectories and task references for further study of agent processes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If many passes are lucky, real-world deployment of these agents may encounter more failures once test oracles are imperfect.
Filtering training data to exclude Lucky trajectories could improve downstream agent reliability.
The PTA reference approach might extend to other agent domains where outcome-only evaluation hides process flaws.
Re-running the evaluation on the full SWE-bench Verified set could reveal whether the 10.7% rate holds beyond the 47 tasks with sufficient data.

Load-bearing premise

Merging multiple passing solutions into Prefix Tree Acceptor references produces a pure model of principled behavior without incorporating lucky elements from the source trajectories.

What would settle it

Manual review of the released AgentLens-Bench annotations to check whether trajectories scored as Lucky actually display the listed disordered behaviors such as regression cycles or absent verification steps.

Figures

Figures reproduced from arXiv: 2605.12925 by Benjamin Steenhoek, Gaurav Mittal, Pingping Lin, Priyam Sahoo, Shengjie Ma, Xiaomin Li, Yu Hu.

**Figure 1.** Figure 1: Passing trajectories are not behaviorally homogeneous. Among 1,136 passing trajectories in AGENTLENSBench, AGENTLENS classifies 229 as Ideal (20.2%), 785 as Solid (69.1%), and 122 as Lucky (10.7%). Binary evaluation treats all of these trajectories as equally successful, while process-aware scoring separates direct, coherent solutions from weak processes that happen to pass. further 69.1% are Solid but im… view at source ↗

**Figure 2.** Figure 2: AGENTLENS workflow. AGENTLENS starts from raw execution traces and converts them into structured state sequences. Intent-stage labeling assigns each state a process role, such as exploration, implementation, testing, or cleanup. These intent-labeled states become the states used for per-trace PTA construction. Passing trajectories for the same task are then merged into a task-level PTA, which represents th… view at source ↗

**Figure 3.** Figure 3: PTA construction. Two individual passing traces sharing an early exploration prefix but diverging at implementation are merged into a single DAG. Shared nodes reflect equivalent actions across agents; branches reflect genuine strategic divergence. During construction, states from different trajectories are merged when they represent equivalent actions. The equivalence engine handles surface variation such… view at source ↗

**Figure 4.** Figure 4: Intent-stage classification decision tree. Each agent action is classified into one of four intent stages (Exploration, Implementation, Verification, Orchestration) via a priority cascade of seven rules. Rules 1 to 4 (gray diamonds) assign fixed stages based on tool type. Rules 5 to 7 (yellow diamonds) are context-sensitive: the same tool can map to different stages depending on whether a source-file edit … view at source ↗

**Figure 5.** Figure 5: Scoring signal examples. Each panel contrasts a principled trajectory (left) against a chaotic trajectory (right) on one of the four scoring dimensions: (a) coherence, (b) structural alignment, (c) set coverage, and (d) temporal profile divergence. Together, the four signals capture complementary aspects of process quality. the Ideal-versus-Lucky waste comparison that identifies the blind-retry severity fi… view at source ↗

**Figure 6.** Figure 6: Failure-mode gallery. Six annotated stage-colored timelines: regression loop, blind-retry cluster, temporal disorder, E/V confusion before and after the context-sensitive fix, unnecessary exploration, and a cyclic pattern. C.5 Heuristic Labeler Validation The main text reports aggregate labeler reliability (Section 5.4): Fleiss’ κ = 0.933 with 96.0% raw agreement across seven annotators. This subsection pr… view at source ↗

**Figure 7.** Figure 7: Score Density by Outcome. Overlapping density histograms of quality scores for Pass (n=1,136) and Fail (n=679) instances. The vertical dashed line marks the empirically chosen threshold at 46.4. Pass instances concentrate in the upper range while fail instances skew left. This confirms the score’s ability to separate outcomes. C.7 Baseline Results [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Pass Rate vs. Mean Quality Score. Each point represents one of eight LLM coding agents evaluated on AGENTLENS-BENCH. Dot color encodes rank divergence, defined as |PR − QS| where PR is the pass-rate rank and QS is the quality-score rank: • consistent (≤ 1), • moderate (2–3), • divergent (≥ 4). Arrows indicate whether a model’s quality rank improves (▲) or drops (▼) relative to its pass-rate rank. 0 5 10 15… view at source ↗

**Figure 9.** Figure 9 [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Lucky Pass category distribution. The 122 Lucky Passes decompose into five categories: C1 Minimal & Unverified (19, 15.6%), C2 Brute-Force Convergence (42, 34.4%), C3 Incomplete Implementation (41, 33.6%), C4 Excessive Exploration (5, 4.1%), and C5 Divergent-but-Valid (15, 12.3%). Categories C2 and C3 together account for 68% of all Lucky Passes. C1: Minimal & Unverified (n = 19, 15.6%). The agent finds a… view at source ↗

read the original abstract

Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome-only view treats a principled solution and a chaotic trial-and-error process as equivalent. We show that this equivalence is empirically false. We evaluate 2,614 OpenHands trajectories from eight model backends on 60 SWE-bench Verified tasks. Of these, 47 have enough passing trajectories to construct task-level process references, yielding a 1,815-trajectory evaluation subset. Among passing trajectories in this subset, 10.7% exhibit behavior we call a Lucky Pass: regression cycles, blind retries, missing verification, or temporally disordered exploration, implementation, and verification. We introduce AgentLens, a framework for process-level assessment of SWE-agent trajectories, and release AgentLens-Bench, a dataset of 1,815 trajectories annotated with quality scores, waste signals, divergence points, and 47 task-level Prefix Tree Acceptor (PTA) references. AgentLens builds PTA references by merging multiple passing solutions for the same task, and uses a context-sensitive intent labeler to assign actions to Exploration, Implementation, Verification, or Orchestration based on trajectory history rather than tool identity alone. On AgentLens-Bench, the quality score separates passing trajectories into Lucky, Solid, and Ideal tiers and further decomposes Lucky Passes into five recurring mechanisms. Across the eight model backends, Lucky rates range from 0.5% to 23.2%, and some models move by as many as five rank positions when ranked by quality score instead of pass rate. We release the anonymized project repository, including the AgentLens-Bench dataset and AgentLens SDK, at https://github.com/microsoft/code-agent-state-trajectories/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags that 10.7% of passing SWE-agent trajectories are lucky passes from messy processes and offers a process-scoring framework to catch them.

read the letter

The paper shows that a chunk of what we count as successful SWE-agent runs are actually lucky passes driven by sloppy processes rather than solid reasoning. In their 1815-trajectory subset from 47 SWE-bench tasks, 10.7% of passes involved regression cycles, blind retries, missing verification steps, or jumbled order of exploration, implementation, and verification across eight model backends on OpenHands trajectories. Lucky rates ranged from 0.5% to 23.2% by model, and switching to quality scores shifted some model rankings by as many as five positions. They introduce AgentLens with context-sensitive intent labeling for actions and Prefix Tree Acceptor references built by merging multiple passing solutions per task. This produces quality tiers (Lucky, Solid, Ideal) and breaks lucky passes into five recurring mechanisms. They also release the annotated AgentLens-Bench dataset and SDK, which is a practical step for others to use or check the work. The release and the observed rank changes are the clearest strengths here. One soft spot is the reference construction. The PTA references merge all passing trajectories without described filtering or quality thresholds, so they could embed some of the disordered or wasteful sequences as canonical paths. That risks making the 10.7% figure partly circular rather than a clean deviation from principled behavior. Annotation details for the intent labels would also help confirm reliability. This is aimed at researchers and practitioners building or evaluating SWE agents and benchmarks. Anyone who already questions binary pass rates will get direct value from the data and framework. The evidence on the phenomenon and the concrete proposal is strong enough to deserve peer review, though the reference method needs close checking.

Referee Report

2 major / 2 minor

Summary. The paper claims that binary pass/fail evaluation of SWE agents is insufficient, as many passing trajectories involve inefficient 'Lucky Pass' processes (regression cycles, blind retries, missing verification, or temporally disordered exploration/implementation/verification). Using AgentLens on 1,815 trajectories from 2,614 total across eight models and 60 SWE-bench Verified tasks (47 tasks with sufficient passes), they construct task-level Prefix Tree Acceptor (PTA) references by merging passing solutions, apply a context-sensitive intent labeler to categorize actions, and derive quality scores that partition trajectories into Lucky/Solid/Ideal tiers. This yields a 10.7% lucky pass rate, model-specific lucky rates from 0.5% to 23.2%, and rank shifts of up to five positions when using quality scores versus pass rates. The work releases AgentLens-Bench (annotated trajectories and 47 PTA references) and an SDK.

Significance. If the central measurements are robust, the result is significant because it demonstrates that outcome-only metrics can mask substantial process differences in agent behavior, with direct implications for benchmark design and model ranking in software engineering agents. The open release of the annotated dataset, PTA references, and SDK is a clear strength for reproducibility and enables follow-on work on process-aware evaluation. The empirical rank-shift evidence provides a falsifiable hook for the community to test whether quality tiers predict downstream properties such as maintainability or generalization.

major comments (2)

[§3] §3 (PTA reference construction): The PTA references are formed by merging all passing trajectories per task without an explicit quality filter or iterative removal of trajectories later classified as lucky passes. Because the source set contains the same regression cycles and disordered sequences that the framework later flags, the acceptor may encode non-principled paths as canonical; this directly affects the validity of the 10.7% lucky-pass statistic and the quality-tier separation. A sensitivity analysis that rebuilds PTAs after removing candidate lucky trajectories, or a description of any ad-hoc filtering, is required to establish that the reference truly captures principled behavior.
[§4.2] §4.2 and Table 2 (quality-score derivation and five-mechanism decomposition): The quality tiers and the breakdown of the 10.7% into five recurring mechanisms rest on the context-sensitive intent labeler and subsequent annotation. The manuscript must report inter-annotator agreement (e.g., Cohen’s κ or percentage agreement) for the Exploration/Implementation/Verification/Orchestration labels and for the lucky-pass mechanism tags; without these metrics it is impossible to assess whether the reported percentages and rank shifts are stable under annotation noise.

minor comments (2)

[§2] Abstract and §2: The total of 2,614 trajectories is reduced to the 1,815-trajectory subset for the 47 tasks that have enough passes; the exact selection rule (minimum number of passes per task, any other filters) should be stated explicitly in the main text rather than only in the abstract.
[Figure 3] Figure 3 (rank-shift visualization): Ensure that the legend clearly distinguishes pass-rate ranking from quality-score ranking and that error bars or confidence intervals are shown for the per-model lucky rates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We believe the concerns raised can be addressed through additional analysis and reporting in the revised manuscript, strengthening the robustness of our claims.

read point-by-point responses

Referee: [§3] §3 (PTA reference construction): The PTA references are formed by merging all passing trajectories per task without an explicit quality filter or iterative removal of trajectories later classified as lucky passes. Because the source set contains the same regression cycles and disordered sequences that the framework later flags, the acceptor may encode non-principled paths as canonical; this directly affects the validity of the 10.7% lucky-pass statistic and the quality-tier separation. A sensitivity analysis that rebuilds PTAs after removing candidate lucky trajectories, or a description of any ad-hoc filtering, is required to establish that the reference truly captures principled behavior.

Authors: We appreciate this observation regarding the potential circularity in PTA construction. While the merging process in Prefix Tree Acceptors inherently captures the most common paths across trajectories, and lucky passes are defined as significant deviations from this merged structure, we acknowledge that including all passes could introduce some noise. In the revised version, we will conduct a sensitivity analysis by iteratively removing trajectories classified as lucky passes and rebuilding the PTAs to verify stability of the quality scores and lucky-pass rates. This will be added to §3. revision: yes
Referee: [§4.2] §4.2 and Table 2 (quality-score derivation and five-mechanism decomposition): The quality tiers and the breakdown of the 10.7% into five recurring mechanisms rest on the context-sensitive intent labeler and subsequent annotation. The manuscript must report inter-annotator agreement (e.g., Cohen’s κ or percentage agreement) for the Exploration/Implementation/Verification/Orchestration labels and for the lucky-pass mechanism tags; without these metrics it is impossible to assess whether the reported percentages and rank shifts are stable under annotation noise.

Authors: We agree that reporting inter-annotator agreement is essential for validating the annotation process. The context-sensitive intent labeler combines automated rules with human verification for ambiguous cases. In the revision, we will include Cohen’s κ scores for both the intent labels (Exploration/Implementation/Verification/Orchestration) and the lucky-pass mechanism tags, based on a double-annotation of a subset of trajectories. This addition will appear in §4.2. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent annotations and merged references

full rationale

The paper constructs task-level PTA references by merging multiple passing trajectories and applies a context-sensitive intent labeler plus quality annotations to identify Lucky Passes (regression cycles, blind retries, missing verification, disordered sequences) at 10.7%. Quality scores and tiering into Lucky/Solid/Ideal are built from these annotations and references rather than from the binary pass-rate signal or any fitted parameter. No self-citation load-bearing steps, uniqueness theorems imported from authors, or ansatzes smuggled via prior work appear in the derivation chain. The central empirical claims rest on the released AgentLens-Bench dataset and explicit behavioral signals, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that merged passing trajectories form a reliable reference for principled behavior and that context-sensitive labeling accurately captures intent independent of tool identity.

free parameters (1)

quality tier thresholds
Cutoffs separating Lucky, Solid, and Ideal trajectories are not specified in the abstract.

axioms (1)

domain assumption Multiple passing trajectories for the same task can be merged into a Prefix Tree Acceptor that represents ideal solution structure.
Invoked when constructing the 47 task-level references used for quality scoring.

invented entities (1)

Lucky Pass no independent evidence
purpose: Category for passing trajectories that exhibit regression cycles, blind retries, or disordered process steps.
Defined from observed behaviors in the 1815-trajectory subset.

pith-pipeline@v0.9.0 · 5641 in / 1323 out tokens · 70456 ms · 2026-05-14T18:46:32.801828+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 13 internal anchors

[1]

arXiv preprint arXiv:2410.20285 , year=

Swe-search: Enhancing software agents with monte carlo tree search and iterative refinement , author=. arXiv preprint arXiv:2410.20285 , year=

work page arXiv
[2]

Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411, 2025

Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents , author=. arXiv preprint arXiv:2505.20411 , year=

work page arXiv
[3]

arXiv preprint arXiv:2506.18824 , year=

Understanding software engineering agents: A study of thought-action-result trajectories , author=. arXiv preprint arXiv:2506.18824 , year=

work page arXiv
[4]

2026 , note =

Agent Trajectory Format (ATIF) , author =. 2026 , note =

work page 2026
[5]

doi:10.5281/zenodo.19357078 , url =

Max Brunsfeld and Amaan Qureshi and Andrew Hlynskyi and Will Lillis and ObserverOfTime and Christian Clason and dundargoc and Phil Turnbull and Timothy Clem and Douglas Creager and Andrew Helwer and Riley Bruins and Antonin Delpeuch and Daumantas Kavolis and Michael Davis and Ika and bfredl and Tuấn-Anh Nguyễn and Amin Ya and Stafford Brunk and skewb1k an...

work page doi:10.5281/zenodo.19357078
[6]

Why Do Multi-Agent LLM Systems Fail?

Why do multi-agent llm systems fail? , author=. arXiv preprint arXiv:2503.13657 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

International Journal of Innovative Research in Technology , year =

Sana Ansari, Sakshi Kini , title =. International Journal of Innovative Research in Technology , year =

work page
[8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

TRAIL: Trace reasoning and agentic issue localization.arXiv preprint arXiv:2505.08638, 2025

Trail: Trace reasoning and agentic issue localization , author=. arXiv preprint arXiv:2505.08638 , year=

work page arXiv
[10]

Pattern recognition letters , volume=

An introduction to ROC analysis , author=. Pattern recognition letters , volume=. 2006 , publisher=

work page 2006
[11]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents,

R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents , author=. arXiv preprint arXiv:2504.07164 , year=

work page arXiv
[13]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

IEEE Transactions on software engineering , volume=

An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks , author=. IEEE Transactions on software engineering , volume=. 2006 , publisher=

work page 2006
[15]

Science , volume=

Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

work page 2022
[16]

The twelfth international conference on learning representations , year=

Let's verify step by step , author=. The twelfth international conference on learning representations , year=

work page
[17]

IEEE Transactions on Information theory , volume=

Divergence measures based on the Shannon entropy , author=. IEEE Transactions on Information theory , volume=. 2002 , publisher=

work page 2002
[18]

AgentBench: Evaluating LLMs as Agents

Agentbench: Evaluating llms as agents , author=. arXiv preprint arXiv:2308.03688 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Proceedings of the ACM on Programming Languages , volume=

Process-Centric Analysis of Agentic Software Systems , author=. Proceedings of the ACM on Programming Languages , volume=. 2026 , publisher=

work page 2026
[20]

Advances in neural information processing systems , volume=

Agentboard: An analytical evaluation board of multi-turn llm agents , author=. Advances in neural information processing systems , volume=

work page
[21]

Advances in neural information processing systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

work page
[22]

arXiv preprint arXiv:2511.00197 , year=

Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories , author=. arXiv preprint arXiv:2511.00197 , year=

work page arXiv
[23]

SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance soft- ware engineering?arXiv preprint, 2025

Swe-lancer: Can frontier llms earn \ 1 million from real-world freelance software engineering? , author=. arXiv preprint arXiv:2502.12115 , year=

work page arXiv
[24]

Pattern recognition and image analysis , volume=

Inferring regular languages in polynomial update time , author=. Pattern recognition and image analysis , volume=. 1992 , publisher=

work page 1992
[26]

Training software engineering agents and verifiers with swe-gym, 2024.URL https://arxiv

Training software engineering agents and verifiers with swe-gym , author=. arXiv preprint arXiv:2412.21139 , year=

work page arXiv
[27]

arXiv preprint arXiv:2505.15277 , year=

Web-shepherd: Advancing prms for reinforcing web agents , author=. arXiv preprint arXiv:2505.15277 , year=

work page arXiv
[28]

Advances in neural information processing systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

work page
[29]

Advances in neural information processing systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[30]

arXiv preprint arXiv:2512.21919 , year=

SWE-RM: Execution-free Feedback For Software Engineering Agents , author=. arXiv preprint arXiv:2512.21919 , year=

work page arXiv
[31]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces , author=. arXiv preprint arXiv:2601.11868 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Solving math word problems with process- and outcome-based feedback

Solving math word problems with process-and outcome-based feedback , author=. arXiv preprint arXiv:2211.14275 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Wang, Xingyao and Chen, Yangyi and Yuan, Lifan and Zhang, Yizhe and Li, Yunzhu and Peng, Hao and Ji, Heng , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024
[34]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Voyager: An open-ended embodied agent with large language models , author=. arXiv preprint arXiv:2305.16291 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[36]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Openhands: An open platform for ai software developers as generalist agents , author=. arXiv preprint arXiv:2407.16741 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Multi-swe-bench: A multilingual benchmark for issue resolving.arXiv preprint arXiv:2504.02605, 2025

Multi-swe-bench: A multilingual benchmark for issue resolving , author=. arXiv preprint arXiv:2504.02605 , year=

work page arXiv
[38]

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution , author=. arXiv preprint arXiv:2502.18449 , year=

work page internal anchor Pith review arXiv
[39]

Agentless: Demystifying LLM-based Software Engineering Agents

Agentless: Demystifying llm-based software engineering agents , author=. arXiv preprint arXiv:2407.01489 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

emnlp-industry.37/

Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and Liu, Yitao and Xu, Yiheng and Zhou, Shuyan and Savarese, Silvio and Xiong, Caiming and Zhong, Victor and Yu, Tao , booktitle =. OSWorld: Benchmarking Multimodal Agents for Open-En...

work page doi:10.52202/079017-1650
[41]

Bradley , keywords =

Andrew P. Bradley , keywords =. The use of the area under the ROC curve in the evaluation of machine learning algorithms , journal =. 1997 , issn =. doi:https://doi.org/10.1016/S0031-3203(96)00142-2 , url =

work page doi:10.1016/s0031-3203(96)00142-2 1997
[42]

Massey , journal =

Frank J. Massey , journal =. The Kolmogorov-Smirnov Test for Goodness of Fit , urldate =

work page
[43]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

Yang, John and Jimenez, Carlos and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , booktitle =. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , url =. doi:10.52202/079017-1601 , editor =

work page doi:10.52202/079017-1601
[44]

2025 , eprint=

SWE-smith: Scaling Data for Software Engineering Agents , author=. 2025 , eprint=

work page 2025
[45]

2023 , eprint=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

work page 2023
[46]

and Cao, Yuan and Narasimhan, Karthik , title =

Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas L. and Cao, Yuan and Narasimhan, Karthik , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

work page 2023
[47]

Youden, W. J. , title =. Cancer , volume =. doi:https://doi.org/10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3 , url =. https://acsjournals.onlinelibrary.wiley.com/doi/pdf/10.1002/1097-0142 year =

work page doi:10.1002/1097-0142(1950)3:1 1950
[48]

2020 , eprint=

BERTScore: Evaluating Text Generation with BERT , author=. 2020 , eprint=

work page 2020
[49]

2024 , isbn =

Zhang, Yuntong and Ruan, Haifeng and Fan, Zhiyu and Roychoudhury, Abhik , title =. 2024 , isbn =. doi:10.1145/3650212.3680384 , booktitle =

work page doi:10.1145/3650212.3680384 2024
[50]

P rocess B ench: Identifying Process Errors in Mathematical Reasoning

Zheng, Chujie and Zhang, Zhenru and Zhang, Beichen and Lin, Runji and Lu, Keming and Yu, Bowen and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang. P rocess B ench: Identifying Process Errors in Mathematical Reasoning. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/20...

work page doi:10.18653/v1/2025.acl-long.50 2025
[51]

WebArena: A Realistic Web Environment for Building Autonomous Agents

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. arXiv preprint arXiv:2307.13854 , url=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

2025 , eprint=

Establishing Best Practices for Building Rigorous Agentic Benchmarks , author=. 2025 , eprint=

work page 2025
[53]

2024 , eprint=

Agent-as-a-Judge: Evaluate Agents with Agents , author=. 2024 , eprint=

work page 2024
[54]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions , author=. arXiv preprint arXiv:2406.15877 , year=

work page internal anchor Pith review arXiv
[55]

and LaToza, T

Alaboudi, A. and LaToza, T. D. , title =

work page
[56]

Moatless tools: An open-source SWE-agent toolkit , year =

work page
[57]

and others , title =

Brunsfeld, M. and others , title =

work page
[58]

Jimenez, C. E. and others , title =

work page
[59]

, title =

Lin, J. , title =. IEEE Transactions on Information Theory , volume =

work page
[60]

and others , title =

Pan, J. and others , title =

work page
[61]

2505.23419 , archivePrefix =

Zhang, Linghao and others , year =. 2505.23419 , archivePrefix =

work page arXiv
[62]

and others , title =

Xie, T. and others , title =

work page