arxiv: 2605.08366 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.SE

Recognition: no theorem link

SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

Mohit Raghavendra , Soham Dan , Miguel Romero Calvo , Yannis Yiming He , Johannes Baptist Mols , Gautam Anand , Cole McCollum , Edgar Arakelyan

show 7 more authors

Vijay Bharadwaj Andrew Park Jeff Da MohammadHossein Rezaei Bing Liu Brad Kenstler Yunzhong He

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:24 UTC · model grok-4.3

classification 💻 cs.LG cs.SE

keywords coding agentsbenchmarksoftware engineeringtest writingrefactoringcode qualityLLM evaluationagentic tasks

0 comments

The pith

SWE Atlas supplies a benchmark suite that scores coding agents on engineering quality as well as functional correctness across codebase questions, test writing, and refactoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SWE Atlas to fill gaps in existing evaluations of coding agents by covering three professional workflows that prior benchmarks largely ignore. It formulates tasks in an under-specified, agentic style and applies both automated checks and rubric scoring that examine maintainability, reusable abstractions, test completeness, and codebase hygiene. This approach matters because real software work requires more than getting code to pass tests; agents must also produce code that other engineers can trust and extend. The authors run frontier and open-weight models on the suite and observe that even the strongest ones still miss subtle edge cases and best-practice standards. The result is a complementary measurement tool intended to push development toward agents that handle the full scope of engineering tasks.

Core claim

SWE Atlas consists of 284 tasks divided into Codebase Q&A (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks). Evaluation combines programmatic verification with category-specific rubrics that assess software-engineering quality beyond mere correctness. Frontier models such as GPT-5.4 and Opus 4.7 achieve the highest aggregate scores, yet all tested models, especially open-weight ones, perform poorly overall and consistently falter on subtle edge cases, complex runtime analysis, and adherence to maintainability practices. The analysis indicates that successful performance depends on extensive codebase exploration and runtime-driven reasoning.

What carries the argument

Under-specified, agentic task formulations paired with a dual evaluation framework that merges programmatic checks and rubric-based scoring of engineering quality attributes.

If this is right

Models must improve their handling of edge cases and runtime reasoning to reach high scores on the new tasks.
Open-weight models will require substantial gains before they match frontier performance on quality-oriented engineering work.
Agents that succeed rely on broad codebase exploration rather than narrow local fixes.
Future agent designs need explicit mechanisms for producing maintainable and reusable code, not only correct code.
Evaluations limited to issue resolution will underestimate gaps in practical coding capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives for coding agents could incorporate rubric-style rewards for abstraction reuse and test coverage to close the observed quality gaps.
Combining SWE Atlas scores with existing issue-resolution benchmarks would give a fuller picture of where agents still fall short.
The rubric dimensions might serve as diagnostic tools for prompt engineering or fine-tuning focused on software hygiene.
If widely adopted, the benchmark could shift industry focus from raw pass rates toward measurable engineering standards in agent outputs.

Load-bearing premise

The chosen tasks, their open-ended formulations, and the rubric criteria faithfully represent the range and quality expectations of actual professional software engineering work.

What would settle it

A controlled study in which teams using high-scoring versus low-scoring agents produce measurably different levels of maintainable, reusable code in real multi-week projects, or a survey in which practicing engineers rate the benchmark tasks as unrepresentative of daily workflows.

Figures

Figures reproduced from arXiv: 2605.08366 by Andrew Park, Bing Liu, Brad Kenstler, Cole McCollum, Edgar Arakelyan, Gautam Anand, Jeff Da, Johannes Baptist Mols, Miguel Romero Calvo, MohammadHossein Rezaei, Mohit Raghavendra, Soham Dan, Vijay Bharadwaj, Yannis Yiming He, Yunzhong He.

**Figure 2.** Figure 2: Engineering quality lags functional correctness. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Analyzing the failure modes of Codebase Q&A tasks under mini-SWE-agent. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Resolve Rate (Pass@1) and success analysis for Test Writing tasks [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Left: per-task resolve rate bucketed by gold-patch lines of change. Every model collapses as refactor demands become larger. Right: fraction of trials whose patch touched every gold-modified file. Frontier models cover more of the required files but still miss many call sites. Models struggle on multi-file refactors Refactoring tasks span 35 to 2073 lines of change in the gold patch. Figure 5a bins tasks i… view at source ↗

**Figure 6.** Figure 6: Resolve rate (Pass@1) on a 30-task subset of each workflow across the Claude Opus 4.1 – 4.7 family, an [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Average tool calls per trial under the minimal mini-SWE-agent scaffold versus the native Claude Code [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Tool use distribution of frontier models on the under-specified Codebase Q&A, Test Writing and [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Repository and language distribution across SWE Atlas. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Extended example tasks from SWE Atlas with the full GPT 5.4 trajectory and response, complementing [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: SWE Atlas refactoring compared to SWE Bench Verfied and SWE Bench Pro [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Pass@3 (at least 1 of 3 trials pass), Pass@1 (average pass rate across 3 trials), and Pass [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Pass@1 by task category for the four top vendor-scaffold configurations. Categories within each [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Cost per task (USD) vs. Pass@1 on Codebase Q&A + Test Writing combined. Each point aggregates 642 [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

read the original abstract

We introduce SWE Atlas, a benchmark suite for coding agents spanning three professional software engineering workflows: Codebase Q&A (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks). SWE Atlas differs from prior SWE benchmarks in three key ways: it targets underrepresented but practically important task categories, uses comprehensive category-specific evaluation protocols, and adopts under-specified, agentic task formulations that better reflect real-world usage. Its evaluation framework combines programmatic checks with rubric-based assessment. This goes beyond functional correctness, evaluating software engineering quality, including test and refactor completeness, maintainability, reusable abstractions, and codebase hygiene. We evaluate a range of frontier and open-weight models on SWE Atlas and find that GPT-5.4 and Opus 4.7 achieve the strongest overall performance, while even the best open-weight models score poorly. Our analysis suggests that top models rely on extensive codebase exploration and runtime-driven reasoning. However, even top models consistently struggle with subtle edge cases, complex runtime analysis, and adherence to software engineering best practices. Overall, SWE Atlas provides a complementary evaluation suite for measuring both correctness and engineering quality in coding agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWE Atlas adds Q&A, test writing, and refactoring tasks with rubrics for engineering quality beyond correctness, but the rubrics have no reported validation or reliability checks.

read the letter

This paper introduces SWE Atlas as a benchmark covering codebase Q&A, test writing, and refactoring. It moves past the issue-resolution focus of earlier SWE benchmarks by using under-specified tasks and rubrics that score things like test completeness, maintainability, reusable abstractions, and codebase hygiene alongside functional checks. The evaluations cover several frontier and open-weight models, with GPT-5.4 and Opus 4.7 leading and open models trailing. The analysis notes that stronger results tie to more codebase exploration and runtime reasoning, while all models still miss edge cases and best practices. That part gives a concrete signal on current limits in practical agent use. The rubrics are the main soft spot. The abstract and high-level description lay out category-specific scoring but give no details on how the criteria were chosen, no inter-rater reliability numbers, and no comparison to expert engineer judgments or real project outcomes. Without that grounding, the claim that these scores measure a distinct engineering quality separate from correctness stays unproven. The tasks themselves look reasonable on paper, but the evaluation framework needs more evidence to hold up as complementary. This work is aimed at researchers testing coding agents who want signals on workflows beyond bug fixes. The model comparisons and task variety are concrete enough to discuss. I would send it for peer review. The idea of broadening the benchmark and adding quality rubrics is worth referee input even if the current rubrics need tightening and more validation data.

Referee Report

3 major / 2 minor

Summary. The paper introduces SWE Atlas, a benchmark suite for coding agents with 124 Codebase Q&A tasks, 90 Test Writing tasks, and 70 Refactoring tasks. It differs from prior SWE benchmarks by targeting underrepresented workflows, using under-specified agentic formulations, and combining programmatic checks with category-specific rubrics to assess not only functional correctness but also engineering quality aspects such as test completeness, maintainability, reusable abstractions, and codebase hygiene. Evaluations of frontier and open-weight models show GPT-5.4 and Opus 4.7 achieving the strongest performance, while even top models struggle with edge cases, runtime analysis, and best practices; the work concludes that SWE Atlas offers a complementary evaluation suite.

Significance. If the rubric-based measures of engineering quality prove reliable and distinct from correctness, SWE Atlas would address a genuine gap in existing benchmarks by evaluating practically relevant workflows and quality dimensions, potentially informing more robust coding agent development. The empirical results on model performance and failure modes provide useful data points, though their interpretability hinges on the validity of the evaluation protocols.

major comments (3)

[§3 and §4] §3 (Benchmark Design) and §4 (Evaluation Framework): The central claim that rubrics measure a distinct 'engineering quality' dimension requires evidence that the criteria (test completeness, maintainability, reusable abstractions, codebase hygiene) are grounded in professional standards. The manuscript describes category-specific rubrics and under-specified tasks but reports no derivation process, no inter-rater reliability statistics, and no correlation with expert judgments or real-world outcomes. This is load-bearing for the complementarity argument, as unvalidated rubrics risk conflating subjective preferences with measurable quality.
[§5] §5 (Results and Analysis): The claim that top models 'rely on extensive codebase exploration and runtime-driven reasoning' and 'consistently struggle with subtle edge cases' is presented as a key insight, yet the supporting evidence appears to rest on qualitative observations without quantitative breakdowns (e.g., per-task failure rates or ablation on exploration depth). This weakens the analysis of model limitations relative to the benchmark's goals.
[§3] Task selection (throughout §3): The weakest assumption is that the chosen tasks and under-specified formulations accurately represent professional software engineering workflows. The paper provides no external validation (e.g., expert review of task realism or comparison to industry issue distributions), which is necessary to secure the claim that SWE Atlas better reflects real-world usage than prior benchmarks.

minor comments (2)

[Abstract and §1] The abstract and introduction would benefit from a concise table summarizing task counts, evaluation metrics, and key differences from prior benchmarks (e.g., SWE-bench) for quick reader orientation.
[§5] Notation for model names (e.g., 'GPT-5.4', 'Opus 4.7') should be clarified with exact versions or citations to avoid ambiguity in reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your constructive feedback on our paper. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we will revise the manuscript to address the concerns raised.

read point-by-point responses

Referee: [§3 and §4] §3 (Benchmark Design) and §4 (Evaluation Framework): The central claim that rubrics measure a distinct 'engineering quality' dimension requires evidence that the criteria (test completeness, maintainability, reusable abstractions, codebase hygiene) are grounded in professional standards. The manuscript describes category-specific rubrics and under-specified tasks but reports no derivation process, no inter-rater reliability statistics, and no correlation with expert judgments or real-world outcomes. This is load-bearing for the complementarity argument, as unvalidated rubrics risk conflating subjective preferences with measurable quality.

Authors: We thank the referee for highlighting this important point. The rubrics were developed by drawing on established software engineering literature and best practices, including principles from 'Clean Code' by Robert Martin and refactoring guidelines from Martin Fowler. However, we agree that explicit documentation of the derivation process and inter-rater reliability would strengthen the claims. In the revised manuscript, we will add a detailed subsection in §3 describing the rubric creation process, including references to professional standards, and report inter-rater agreement statistics based on multiple annotators. We will also discuss potential correlations with real-world outcomes as a direction for future work. This addresses the concern without altering the core evaluation framework. revision: yes
Referee: [§5] §5 (Results and Analysis): The claim that top models 'rely on extensive codebase exploration and runtime-driven reasoning' and 'consistently struggle with subtle edge cases' is presented as a key insight, yet the supporting evidence appears to rest on qualitative observations without quantitative breakdowns (e.g., per-task failure rates or ablation on exploration depth). This weakens the analysis of model limitations relative to the benchmark's goals.

Authors: We acknowledge that the current analysis relies partly on qualitative insights from our evaluations. To provide stronger quantitative support, we will include additional analyses in the revised §5, such as per-task failure rate breakdowns by category (e.g., edge cases vs. others) and an ablation study examining the impact of exploration depth on performance. This will make the claims more robust and directly tied to the data. revision: yes
Referee: [§3] Task selection (throughout §3): The weakest assumption is that the chosen tasks and under-specified formulations accurately represent professional software engineering workflows. The paper provides no external validation (e.g., expert review of task realism or comparison to industry issue distributions), which is necessary to secure the claim that SWE Atlas better reflects real-world usage than prior benchmarks.

Authors: The task selection was informed by an analysis of common workflows in open-source repositories and professional software engineering practices, focusing on areas underrepresented in existing benchmarks like SWE-bench. While we did not perform a formal external expert review in this version, we will revise §3 to include more details on the selection criteria, such as statistics from GitHub issues and pull requests that motivated the task categories. We will also add a limitations section acknowledging the need for broader validation in future iterations. This maintains the paper's position while addressing the validation concern. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fitted predictions

full rationale

The paper introduces SWE Atlas as a new benchmark suite across three task categories with category-specific evaluation protocols that combine programmatic checks and rubric-based assessment. No mathematical derivations, equations, fitted parameters, or model predictions appear in the provided text. Claims about model performance (e.g., GPT-5.4 and Opus 4.7 achieving strongest results) are direct empirical observations from running models on the introduced tasks, not reductions of outputs to self-defined inputs by construction. Rubric criteria for engineering quality are presented as part of the benchmark definition rather than derived via any chain that loops back to prior results within the paper. This is a standard empirical benchmark introduction whose central contribution is the task set and evaluation framework itself, with no load-bearing self-citation or ansatz that collapses the argument.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark paper with no mathematical derivations, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5554 in / 1022 out tokens · 49276 ms · 2026-05-12T02:24:29.304579+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 9 internal anchors

[1]

A. F. Akyürek, A. Gosai, C. B. C. Zhang, V . Gupta, J. Jeong, A. Gunjal, T. Rabbani, M. Mazzone, D. Randolph, M. Mahmoudi Meymand, G. Chattha, P . Rodriguez, D. Mares, P . Singh, M. Liu, S. Chawla, P . Cline, L. Ogaz, E. Hernandez, Z. Wang, P . Bhatter, M. Ayestaran, B. Liu, and Y. He. PRBench: Large-scale expert rubrics for evaluating high-stakes profess...

work page arXiv 2025
[2]

Introducing claude opus 4.7, Apr

Anthropic. Introducing claude opus 4.7, Apr. 2026. URLhttps://www.anthropic.com/news/claude-opus-4-7 . Accessed: 2026-05-04

work page 2026
[3]

R. K. Arora, J. Wei, R. Soskin Hicks, P . Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, J. Heidecke, and K. Singhal. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review arXiv 2025
[4]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

SWE-chat: Coding Agent Interactions From Real Users in the Wild

J. Baumann, V . Padmakumar, X. Li, J. Yang, D. Yang, and S. Koyejo. Swe-chat: Coding agent interactions from real users in the wild, 2026. URLhttps://arxiv.org/abs/2604.20779

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

J., Feldman, M

F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman, et al. Multipl-e: A scalable and extensible approach to benchmarking neural code generation. arXiv preprint arXiv:2208.08227, 2022

work page arXiv 2022
[7]

J. Chen, K. Zhao, J. Liu, C. Peng, J. Liu, H. Zhu, P . Gao, P . Yang, and S. Deng. Coreqa: Uncovering potentials of language models in code repository question answering, 2025. URLhttps://arxiv.org/abs/2501.03447

work page arXiv 2025
[8]

M. Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

X. Deng, J. Da, E. Pan, Y. Y. He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025

work page internal anchor Pith review arXiv 2025
[10]

M. Du, B. Xu, C. Zhu, X. Wang, and Z. Mao. Deepresearch bench: A comprehensive benchmark for deep research agents, 2025. URLhttps://arxiv.org/abs/2506.11763. 10

work page arXiv 2025
[11]

Refactorbench: Evaluating stateful reasoning in language agents through code,

D. Gautam, S. Garg, J. Jang, N. Sundaresan, and R. Z. Moghaddam. Refactorbench: Evaluating stateful reasoning in language agents through code.arXiv preprint arXiv:2503.07832, 2025

work page arXiv 2025
[12]

Harbor: A framework for evaluating and optimizing agents and models in container environments, Jan

Harbor Framework Team. Harbor: A framework for evaluating and optimizing agents and models in container environments, Jan. 2026. URLhttps://github.com/harbor-framework/harbor

work page 2026
[13]

X. He, Q. Liu, M. Du, L. Yan, Z. Fan, Y. Huang, Z. Yuan, and Z. Ma. Swe-perf: Can language models optimize code performance on real-world repositories?arXiv preprint arXiv:2507.12415, 2025

work page arXiv 2025
[14]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y. Shin, T. Walshe, et al. Terminal- bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URL https://arxiv. org/abs/2601.11868

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Mündler, M

N. Mündler, M. Müller, J. He, and M. Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 37:81857–81887, 2024

work page 2024
[17]

Why we no longer evaluate SWE-Bench Verified, 2026

OpenAI. Why we no longer evaluate SWE-Bench Verified, 2026. URL https://openai.com/index/ why-we-no-longer-evaluate-swe-bench-verified/. Accessed: 2026-05-04

work page 2026
[18]

Press, B

O. Press, B. Amos, H. Zhao, Y. Wu, S. K. Ainsworth, D. Krupke, P . Kidger, T. Sajed, B. Stellato, J. Park, et al. Algotune: Can language models speed up general-purpose numerical programs?arXiv preprint arXiv:2507.15887, 2025

work page arXiv 2025
[19]

Agentic Rubrics as contextual verifiers for software agents

M. Raghavendra, A. Gunjal, B. Liu, and Y. He. Agentic rubrics as contextual verifiers for swe agents, 2026. URLhttps://arxiv.org/abs/2601.04171

work page arXiv 2026
[20]

Composer 2 technical report.arXiv preprint arXiv:2603.24477, 2026

C. Research, :, A. Chan, A. Shalaby, A. Wettig, A. Sanger, A. Zhai, A. Ajay, A. Nair, C. Snell, C. Lu, C. Shen, E. Jia, F. Cassano, H. Liu, H. Chen, H. Wildermuth, J. Jackson, J. Li, J. Katz, J. Yao, J. Hejna, J. Warner, J. Vering, K. Frans, L. Danilek, L. Wright, L. Cen, L. Melas-Kyriazi, M. Truell, M. de Jong, N. Jain, N. Schmidt, N. Wang, N. Muennighof...

work page arXiv 2026
[21]

ResearchRubrics: Prompt-specific rubrics for deep research agent evaluation

M. Sharma, C. B. C. Zhang, C. Bandi, C. Wang, A. Aich, H. Nghiem, T. Rabbani, Y. Htet, B. Jang, S. Basu, A. Balwani, D. Peskoff, M. Ayestaran, S. M. Hendryx, B. Kenstler, and B. Liu. Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents, 2025. URL https://arxiv.org/abs/2511.07685

work page arXiv 2025
[22]

W. Wang, C. Yang, Z. Wang, Y. Huang, Z. Chu, D. Song, L. Zhang, A. R. Chen, and L. Ma. Testeval: Benchmarking large language models for test case generation. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 3547–3562, 2025

work page 2025
[23]

X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489, 2024

work page internal anchor Pith review arXiv 2024
[25]

Y. Xu, J. Yang, Tse-Hsun, and Chen. Swe-refactor: A repository-level benchmark for real-world llm-based code refactoring, 2026. URLhttps://arxiv.org/abs/2602.03712

work page arXiv 2026
[26]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37: 50528–50652, 2024

work page 2024
[27]

Z. Yu, Y. Zhao, A. Cohan, and X.-P . Zhang. Humaneval pro and mbpp pro: Evaluating large language models on self-invoking code generation.arXiv preprint arXiv:2412.21199, 2024

work page arXiv 2024
[28]

D. Zan, Z. Huang, W. Liu, H. Chen, L. Zhang, S. Xin, L. Chen, Q. Liu, X. Zhong, A. Li, et al. Multi-swe-bench: A multilingual benchmark for issue resolving.arXiv preprint arXiv:2504.02605, 2025. 11 A. Limitations SWE-Atlas was created with specific design choices and, as a result, has a few limitations. Task category choiceSWE Atlas covers distinct new as...

work page arXiv 2025
[29]

A baseline test verifying the executor completes all conﬁgured iterations

work page
[31]

Test that dropped iterations are properly tracked and emitted as metrics when iterations cannot complete within the conﬁgured duration

work page
[32]

I’m pulling exact line numbers from the scheduler, rule routine, ﬁngerprint, and state-manager paths now

Test global iteration indexing across different execution segments to verify iteration indices are correctly distributed. Refactoring “I’m pulling exact line numbers from the scheduler, rule routine, ﬁngerprint, and state-manager paths now” nl -ba /app/pkg/services/ngalert/schedule/schedule.go I’m onboarding into the Grafana repository and trying to under...

work page
[33]

Extracts the local buffer implementation from`pkg/gitparse/gitparse.go`into a shared package

rubric_statement: "Extracts the local buffer implementation from`pkg/gitparse/gitparse.go`into a shared package." - **Scenario 1 - YES**: The diff removes the local`buffer`struct,`state`type, and helper methods from`pkg/gitparse/gitparse. go`and introduces them in a new shared package`pkg/writers/buffer_writer/`. The behavior described in the rubric crite...

work page
[34]

Does NOT introduce a new dependency on`unsafe`Rust code in the refactored module

rubric_statement: "Does NOT introduce a new dependency on`unsafe`Rust code in the refactored module." - This is a **negative** rubric --- the desired outcome is the absence of the behavior. Status "YES" still indicates the described behavior IS present (i.e., the agent DID introduce`unsafe`); status "NO" indicates the behavior IS NOT present (i.e ., the a...

work page
[35]

Explore the repository structure to understand how the codebase is organized

work page
[36]

Find and read code relevant to the question

work page
[37]

If needed, execute scripts or trace code paths to gather evidence for your answer

work page
[38]

Synthesize your findings into a clear, well-supported answer

work page
[39]

1.1" importance: must have text: Reports how the scheduler decides what to work on next (e.g., ready set returned by processTick, rule selection per tick). - id:

When you are confident in your answer, write your complete final answer to /logs/agent/answer.txt wrapped in <<FINAL_ANSWER>> tags. Do not only print the answer in chat output; the answer file is required for scoring. Use this exact format: ```bash mkdir -p /logs/agent cat <<'ANSWER_EOF'> /logs/agent/answer.txt <<FINAL_ANSWER>> Your comprehensive answer h...

work page
[40]

A baseline test verifying the executor completes all configured iterations

work page
[41]

This differentiates it from PerVUIterations executor

Test the shared iteration behavior where when one VU becomes slow, other VUs compensate by picking up more work. This differentiates it from PerVUIterations executor. Verify the slow VU is properly recorded and assert its exact iteration count following existing test patterns

work page
[42]

Test that dropped iterations are properly tracked and emitted as metrics when iterations cannot complete within the configured duration

work page
[43]

Test global iteration indexing across different execution segments to verify iteration indices are correctly distributed. </testing_objective> Your tests MUST be runnable using the following script (be careful if/when you run it as it may run all tests and take a long time): <run_script> [...RUN SCRIPT FOR THE TESTS...] </run_script> Can you help me write...

work page
[44]

Explore the repository structure to understand how the codebase is organized and identify existing test patterns

work page
[45]

Find and read the source code relevant to the testing objective to understand the functionality to be tested

work page
[46]

Identify key behaviors, edge cases, and error conditions that should be covered

work page
[47]

Write comprehensive tests including unit tests, integration tests, and/or acceptance tests as appropriate

work page
[48]

Run your tests using the provided run_script to ensure they pass and correctly verify the expected behavior

work page
[49]

describe block > it/test name

When you are confident your test suite is complete, write your test manifest to /logs/agent/manifest.txt wrapped in << TEST_MANIFEST>> tags: ```bash mkdir -p /logs/agent cat <<'MANIFEST_EOF'> /logs/agent/manifest.txt <<TEST_MANIFEST>> - file: path/to/test_file tests: - TestName1 - TestName2 <<TEST_MANIFEST>> MANIFEST_EOF 23 ``` List every test file you cr...

work page