Recognition: no theorem link
SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution
Pith reviewed 2026-05-12 02:24 UTC · model grok-4.3
The pith
SWE Atlas supplies a benchmark suite that scores coding agents on engineering quality as well as functional correctness across codebase questions, test writing, and refactoring.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SWE Atlas consists of 284 tasks divided into Codebase Q&A (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks). Evaluation combines programmatic verification with category-specific rubrics that assess software-engineering quality beyond mere correctness. Frontier models such as GPT-5.4 and Opus 4.7 achieve the highest aggregate scores, yet all tested models, especially open-weight ones, perform poorly overall and consistently falter on subtle edge cases, complex runtime analysis, and adherence to maintainability practices. The analysis indicates that successful performance depends on extensive codebase exploration and runtime-driven reasoning.
What carries the argument
Under-specified, agentic task formulations paired with a dual evaluation framework that merges programmatic checks and rubric-based scoring of engineering quality attributes.
If this is right
- Models must improve their handling of edge cases and runtime reasoning to reach high scores on the new tasks.
- Open-weight models will require substantial gains before they match frontier performance on quality-oriented engineering work.
- Agents that succeed rely on broad codebase exploration rather than narrow local fixes.
- Future agent designs need explicit mechanisms for producing maintainable and reusable code, not only correct code.
- Evaluations limited to issue resolution will underestimate gaps in practical coding capability.
Where Pith is reading between the lines
- Training objectives for coding agents could incorporate rubric-style rewards for abstraction reuse and test coverage to close the observed quality gaps.
- Combining SWE Atlas scores with existing issue-resolution benchmarks would give a fuller picture of where agents still fall short.
- The rubric dimensions might serve as diagnostic tools for prompt engineering or fine-tuning focused on software hygiene.
- If widely adopted, the benchmark could shift industry focus from raw pass rates toward measurable engineering standards in agent outputs.
Load-bearing premise
The chosen tasks, their open-ended formulations, and the rubric criteria faithfully represent the range and quality expectations of actual professional software engineering work.
What would settle it
A controlled study in which teams using high-scoring versus low-scoring agents produce measurably different levels of maintainable, reusable code in real multi-week projects, or a survey in which practicing engineers rate the benchmark tasks as unrepresentative of daily workflows.
Figures
read the original abstract
We introduce SWE Atlas, a benchmark suite for coding agents spanning three professional software engineering workflows: Codebase Q&A (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks). SWE Atlas differs from prior SWE benchmarks in three key ways: it targets underrepresented but practically important task categories, uses comprehensive category-specific evaluation protocols, and adopts under-specified, agentic task formulations that better reflect real-world usage. Its evaluation framework combines programmatic checks with rubric-based assessment. This goes beyond functional correctness, evaluating software engineering quality, including test and refactor completeness, maintainability, reusable abstractions, and codebase hygiene. We evaluate a range of frontier and open-weight models on SWE Atlas and find that GPT-5.4 and Opus 4.7 achieve the strongest overall performance, while even the best open-weight models score poorly. Our analysis suggests that top models rely on extensive codebase exploration and runtime-driven reasoning. However, even top models consistently struggle with subtle edge cases, complex runtime analysis, and adherence to software engineering best practices. Overall, SWE Atlas provides a complementary evaluation suite for measuring both correctness and engineering quality in coding agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SWE Atlas, a benchmark suite for coding agents with 124 Codebase Q&A tasks, 90 Test Writing tasks, and 70 Refactoring tasks. It differs from prior SWE benchmarks by targeting underrepresented workflows, using under-specified agentic formulations, and combining programmatic checks with category-specific rubrics to assess not only functional correctness but also engineering quality aspects such as test completeness, maintainability, reusable abstractions, and codebase hygiene. Evaluations of frontier and open-weight models show GPT-5.4 and Opus 4.7 achieving the strongest performance, while even top models struggle with edge cases, runtime analysis, and best practices; the work concludes that SWE Atlas offers a complementary evaluation suite.
Significance. If the rubric-based measures of engineering quality prove reliable and distinct from correctness, SWE Atlas would address a genuine gap in existing benchmarks by evaluating practically relevant workflows and quality dimensions, potentially informing more robust coding agent development. The empirical results on model performance and failure modes provide useful data points, though their interpretability hinges on the validity of the evaluation protocols.
major comments (3)
- [§3 and §4] §3 (Benchmark Design) and §4 (Evaluation Framework): The central claim that rubrics measure a distinct 'engineering quality' dimension requires evidence that the criteria (test completeness, maintainability, reusable abstractions, codebase hygiene) are grounded in professional standards. The manuscript describes category-specific rubrics and under-specified tasks but reports no derivation process, no inter-rater reliability statistics, and no correlation with expert judgments or real-world outcomes. This is load-bearing for the complementarity argument, as unvalidated rubrics risk conflating subjective preferences with measurable quality.
- [§5] §5 (Results and Analysis): The claim that top models 'rely on extensive codebase exploration and runtime-driven reasoning' and 'consistently struggle with subtle edge cases' is presented as a key insight, yet the supporting evidence appears to rest on qualitative observations without quantitative breakdowns (e.g., per-task failure rates or ablation on exploration depth). This weakens the analysis of model limitations relative to the benchmark's goals.
- [§3] Task selection (throughout §3): The weakest assumption is that the chosen tasks and under-specified formulations accurately represent professional software engineering workflows. The paper provides no external validation (e.g., expert review of task realism or comparison to industry issue distributions), which is necessary to secure the claim that SWE Atlas better reflects real-world usage than prior benchmarks.
minor comments (2)
- [Abstract and §1] The abstract and introduction would benefit from a concise table summarizing task counts, evaluation metrics, and key differences from prior benchmarks (e.g., SWE-bench) for quick reader orientation.
- [§5] Notation for model names (e.g., 'GPT-5.4', 'Opus 4.7') should be clarified with exact versions or citations to avoid ambiguity in reproducibility.
Simulated Author's Rebuttal
Thank you for your constructive feedback on our paper. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we will revise the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Benchmark Design) and §4 (Evaluation Framework): The central claim that rubrics measure a distinct 'engineering quality' dimension requires evidence that the criteria (test completeness, maintainability, reusable abstractions, codebase hygiene) are grounded in professional standards. The manuscript describes category-specific rubrics and under-specified tasks but reports no derivation process, no inter-rater reliability statistics, and no correlation with expert judgments or real-world outcomes. This is load-bearing for the complementarity argument, as unvalidated rubrics risk conflating subjective preferences with measurable quality.
Authors: We thank the referee for highlighting this important point. The rubrics were developed by drawing on established software engineering literature and best practices, including principles from 'Clean Code' by Robert Martin and refactoring guidelines from Martin Fowler. However, we agree that explicit documentation of the derivation process and inter-rater reliability would strengthen the claims. In the revised manuscript, we will add a detailed subsection in §3 describing the rubric creation process, including references to professional standards, and report inter-rater agreement statistics based on multiple annotators. We will also discuss potential correlations with real-world outcomes as a direction for future work. This addresses the concern without altering the core evaluation framework. revision: yes
-
Referee: [§5] §5 (Results and Analysis): The claim that top models 'rely on extensive codebase exploration and runtime-driven reasoning' and 'consistently struggle with subtle edge cases' is presented as a key insight, yet the supporting evidence appears to rest on qualitative observations without quantitative breakdowns (e.g., per-task failure rates or ablation on exploration depth). This weakens the analysis of model limitations relative to the benchmark's goals.
Authors: We acknowledge that the current analysis relies partly on qualitative insights from our evaluations. To provide stronger quantitative support, we will include additional analyses in the revised §5, such as per-task failure rate breakdowns by category (e.g., edge cases vs. others) and an ablation study examining the impact of exploration depth on performance. This will make the claims more robust and directly tied to the data. revision: yes
-
Referee: [§3] Task selection (throughout §3): The weakest assumption is that the chosen tasks and under-specified formulations accurately represent professional software engineering workflows. The paper provides no external validation (e.g., expert review of task realism or comparison to industry issue distributions), which is necessary to secure the claim that SWE Atlas better reflects real-world usage than prior benchmarks.
Authors: The task selection was informed by an analysis of common workflows in open-source repositories and professional software engineering practices, focusing on areas underrepresented in existing benchmarks like SWE-bench. While we did not perform a formal external expert review in this version, we will revise §3 to include more details on the selection criteria, such as statistics from GitHub issues and pull requests that motivated the task categories. We will also add a limitations section acknowledging the need for broader validation in future iterations. This maintains the paper's position while addressing the validation concern. revision: partial
Circularity Check
No circularity: empirical benchmark with no derivations or fitted predictions
full rationale
The paper introduces SWE Atlas as a new benchmark suite across three task categories with category-specific evaluation protocols that combine programmatic checks and rubric-based assessment. No mathematical derivations, equations, fitted parameters, or model predictions appear in the provided text. Claims about model performance (e.g., GPT-5.4 and Opus 4.7 achieving strongest results) are direct empirical observations from running models on the introduced tasks, not reductions of outputs to self-defined inputs by construction. Rubric criteria for engineering quality are presented as part of the benchmark definition rather than derived via any chain that loops back to prior results within the paper. This is a standard empirical benchmark introduction whose central contribution is the task set and evaluation framework itself, with no load-bearing self-citation or ansatz that collapses the argument.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A. F. Akyürek, A. Gosai, C. B. C. Zhang, V . Gupta, J. Jeong, A. Gunjal, T. Rabbani, M. Mazzone, D. Randolph, M. Mahmoudi Meymand, G. Chattha, P . Rodriguez, D. Mares, P . Singh, M. Liu, S. Chawla, P . Cline, L. Ogaz, E. Hernandez, Z. Wang, P . Bhatter, M. Ayestaran, B. Liu, and Y. He. PRBench: Large-scale expert rubrics for evaluating high-stakes profess...
-
[2]
Introducing claude opus 4.7, Apr
Anthropic. Introducing claude opus 4.7, Apr. 2026. URLhttps://www.anthropic.com/news/claude-opus-4-7 . Accessed: 2026-05-04
work page 2026
-
[3]
R. K. Arora, J. Wei, R. Soskin Hicks, P . Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, J. Heidecke, and K. Singhal. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025
work page internal anchor Pith review arXiv 2025
-
[4]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
SWE-chat: Coding Agent Interactions From Real Users in the Wild
J. Baumann, V . Padmakumar, X. Li, J. Yang, D. Yang, and S. Koyejo. Swe-chat: Coding agent interactions from real users in the wild, 2026. URLhttps://arxiv.org/abs/2604.20779
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman, et al. Multipl-e: A scalable and extensible approach to benchmarking neural code generation. arXiv preprint arXiv:2208.08227, 2022
- [7]
-
[8]
M. Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
X. Deng, J. Da, E. Pan, Y. Y. He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025
work page internal anchor Pith review arXiv 2025
- [10]
-
[11]
Refactorbench: Evaluating stateful reasoning in language agents through code,
D. Gautam, S. Garg, J. Jang, N. Sundaresan, and R. Z. Moghaddam. Refactorbench: Evaluating stateful reasoning in language agents through code.arXiv preprint arXiv:2503.07832, 2025
-
[12]
Harbor: A framework for evaluating and optimizing agents and models in container environments, Jan
Harbor Framework Team. Harbor: A framework for evaluating and optimizing agents and models in container environments, Jan. 2026. URLhttps://github.com/harbor-framework/harbor
work page 2026
- [13]
-
[14]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y. Shin, T. Walshe, et al. Terminal- bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URL https://arxiv. org/abs/2601.11868
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
N. Mündler, M. Müller, J. He, and M. Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 37:81857–81887, 2024
work page 2024
-
[17]
Why we no longer evaluate SWE-Bench Verified, 2026
OpenAI. Why we no longer evaluate SWE-Bench Verified, 2026. URL https://openai.com/index/ why-we-no-longer-evaluate-swe-bench-verified/. Accessed: 2026-05-04
work page 2026
- [18]
-
[19]
Agentic Rubrics as contextual verifiers for software agents
M. Raghavendra, A. Gunjal, B. Liu, and Y. He. Agentic rubrics as contextual verifiers for swe agents, 2026. URLhttps://arxiv.org/abs/2601.04171
-
[20]
Composer 2 technical report.arXiv preprint arXiv:2603.24477, 2026
C. Research, :, A. Chan, A. Shalaby, A. Wettig, A. Sanger, A. Zhai, A. Ajay, A. Nair, C. Snell, C. Lu, C. Shen, E. Jia, F. Cassano, H. Liu, H. Chen, H. Wildermuth, J. Jackson, J. Li, J. Katz, J. Yao, J. Hejna, J. Warner, J. Vering, K. Frans, L. Danilek, L. Wright, L. Cen, L. Melas-Kyriazi, M. Truell, M. de Jong, N. Jain, N. Schmidt, N. Wang, N. Muennighof...
-
[21]
ResearchRubrics: Prompt-specific rubrics for deep research agent evaluation
M. Sharma, C. B. C. Zhang, C. Bandi, C. Wang, A. Aich, H. Nghiem, T. Rabbani, Y. Htet, B. Jang, S. Basu, A. Balwani, D. Peskoff, M. Ayestaran, S. M. Hendryx, B. Kenstler, and B. Liu. Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents, 2025. URL https://arxiv.org/abs/2511.07685
-
[22]
W. Wang, C. Yang, Z. Wang, Y. Huang, Z. Chu, D. Song, L. Zhang, A. R. Chen, and L. Ma. Testeval: Benchmarking large language models for test case generation. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 3547–3562, 2025
work page 2025
-
[23]
X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489, 2024
work page internal anchor Pith review arXiv 2024
- [25]
-
[26]
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37: 50528–50652, 2024
work page 2024
- [27]
-
[28]
D. Zan, Z. Huang, W. Liu, H. Chen, L. Zhang, S. Xin, L. Chen, Q. Liu, X. Zhong, A. Li, et al. Multi-swe-bench: A multilingual benchmark for issue resolving.arXiv preprint arXiv:2504.02605, 2025. 11 A. Limitations SWE-Atlas was created with specific design choices and, as a result, has a few limitations. Task category choiceSWE Atlas covers distinct new as...
-
[29]
A baseline test verifying the executor completes all configured iterations
-
[31]
Test that dropped iterations are properly tracked and emitted as metrics when iterations cannot complete within the configured duration
-
[32]
Test global iteration indexing across different execution segments to verify iteration indices are correctly distributed. Refactoring “I’m pulling exact line numbers from the scheduler, rule routine, fingerprint, and state-manager paths now” nl -ba /app/pkg/services/ngalert/schedule/schedule.go I’m onboarding into the Grafana repository and trying to under...
-
[33]
Extracts the local buffer implementation from`pkg/gitparse/gitparse.go`into a shared package
rubric_statement: "Extracts the local buffer implementation from`pkg/gitparse/gitparse.go`into a shared package." - **Scenario 1 - YES**: The diff removes the local`buffer`struct,`state`type, and helper methods from`pkg/gitparse/gitparse. go`and introduces them in a new shared package`pkg/writers/buffer_writer/`. The behavior described in the rubric crite...
-
[34]
Does NOT introduce a new dependency on`unsafe`Rust code in the refactored module
rubric_statement: "Does NOT introduce a new dependency on`unsafe`Rust code in the refactored module." - This is a **negative** rubric --- the desired outcome is the absence of the behavior. Status "YES" still indicates the described behavior IS present (i.e., the agent DID introduce`unsafe`); status "NO" indicates the behavior IS NOT present (i.e ., the a...
-
[35]
Explore the repository structure to understand how the codebase is organized
-
[36]
Find and read code relevant to the question
-
[37]
If needed, execute scripts or trace code paths to gather evidence for your answer
-
[38]
Synthesize your findings into a clear, well-supported answer
-
[39]
When you are confident in your answer, write your complete final answer to /logs/agent/answer.txt wrapped in <<FINAL_ANSWER>> tags. Do not only print the answer in chat output; the answer file is required for scoring. Use this exact format: ```bash mkdir -p /logs/agent cat <<'ANSWER_EOF'> /logs/agent/answer.txt <<FINAL_ANSWER>> Your comprehensive answer h...
-
[40]
A baseline test verifying the executor completes all configured iterations
-
[41]
This differentiates it from PerVUIterations executor
Test the shared iteration behavior where when one VU becomes slow, other VUs compensate by picking up more work. This differentiates it from PerVUIterations executor. Verify the slow VU is properly recorded and assert its exact iteration count following existing test patterns
-
[42]
Test that dropped iterations are properly tracked and emitted as metrics when iterations cannot complete within the configured duration
-
[43]
Test global iteration indexing across different execution segments to verify iteration indices are correctly distributed. </testing_objective> Your tests MUST be runnable using the following script (be careful if/when you run it as it may run all tests and take a long time): <run_script> [...RUN SCRIPT FOR THE TESTS...] </run_script> Can you help me write...
-
[44]
Explore the repository structure to understand how the codebase is organized and identify existing test patterns
-
[45]
Find and read the source code relevant to the testing objective to understand the functionality to be tested
-
[46]
Identify key behaviors, edge cases, and error conditions that should be covered
-
[47]
Write comprehensive tests including unit tests, integration tests, and/or acceptance tests as appropriate
-
[48]
Run your tests using the provided run_script to ensure they pass and correctly verify the expected behavior
-
[49]
When you are confident your test suite is complete, write your test manifest to /logs/agent/manifest.txt wrapped in << TEST_MANIFEST>> tags: ```bash mkdir -p /logs/agent cat <<'MANIFEST_EOF'> /logs/agent/manifest.txt <<TEST_MANIFEST>> - file: path/to/test_file tests: - TestName1 - TestName2 <<TEST_MANIFEST>> MANIFEST_EOF 23 ``` List every test file you cr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.