Projecting the Emerging Mindset of SWE Agent by Launching a Wild Code Understanding Journey

Yan Liu; Zhengyi Zhuo

arxiv: 2606.08500 · v1 · pith:4TMSEGNKnew · submitted 2026-06-07 · 💻 cs.SE · cs.AI

Projecting the Emerging Mindset of SWE Agent by Launching a Wild Code Understanding Journey

Zhengyi Zhuo , Yan Liu This is my paper

Pith reviewed 2026-06-27 18:16 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords SWE agentscode understandingagent trajectoriesobservation lensesrepository explorationbehavioral profilestool-mediated agentsepistemic grounding

0 comments

The pith

SWE-agent trajectories become comparable behavioral profiles when read through five observation lenses on navigation, evidence, synthesis, grounding, and stopping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Ada, an apparatus that explores real code repositories through a bounded set of tools while recording every think-action step as a finite trajectory. Five observation lenses turn those raw traces into visible records of how the agent chooses where to look, what evidence to trust, when to consolidate understanding, and when to stop. Across 408 trajectories collected from multiple models, repositories, and task conditions, the lenses produce disciplined profiles of agent behavior without collapsing to tool counts or guessing at unrecorded intent. The resulting profiles reveal measurable differences in efficiency, diversity of paths, and epistemic grounding. The work supplies a repeatable method for turning faithful digital traces into projections of an emerging SWE-agent mindset.

Core claim

Ada enters real codebases through a bounded tool interface, allowing open-ended exploration to remain recordable as finite trajectories. We project Ada's think-action chains through observation lenses that make navigation, evidence selection, synthesis, grounding, and stopping visible without reducing behavior to raw tool counts or speculating about hidden intent. Read together, these lenses produce behavioral profiles grounded in recorded movement through software worlds. Across 408 trajectories, spanning multiple models, repositories, task families, and launch conditions, the study shows how faithful digital traces can be transformed into disciplined, comparable projections of emerging SWE

What carries the argument

Ada, a scoped apparatus that explores repositories via a bounded tool interface and projects think-action trajectories through five observation lenses (navigation, evidence selection, synthesis, grounding, stopping) to generate comparable behavioral profiles.

If this is right

Efficiency differences across models become measurable through the same lens-derived profiles rather than post-hoc inspection.
Trajectory diversity and the degree of epistemic grounding can be compared directly across launch conditions.
Limits on how much external intervention can alter agent stopping behavior become observable in the profiles.
The method supplies a repeatable foundation for studying SWE-agent behavior inside actual codebases instead of toy environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The lens approach could scale to automated scoring of thousands of agent runs if the five categories are encoded as classifiers.
Profiles might later be used to diagnose why one agent succeeds on a task while another fails by tracing differences in grounding steps.
The bounded-interface design suggests a template for other domains where agents must explore large state spaces without unbounded tool access.

Load-bearing premise

The five observation lenses can render agent behavior visible and comparable without collapsing it into raw tool counts or requiring guesses about hidden intent.

What would settle it

Apply the same five lenses to a fresh set of trajectories from a different model or repository and check whether the resulting profiles lose all ability to distinguish models or task conditions while still matching independent performance metrics.

read the original abstract

Software engineering agents (SWE agents) increasingly work through tool-mediated trajectories in real repositories, yet their behavior remains difficult to characterize in concrete, observable terms. These trajectories record tool use, intermediate reasoning, evidence selection, and self-directed stopping, but they do not by themselves explain why particular moves were chosen, what evidence was trusted, or when understanding was judged sufficient. This tension makes trajectory data both limited and valuable: faithful, replayable traces can become an empirical substrate for studying agent behavior when interpreted through disciplined observation. We introduce Ada, a scoped apparatus for repository-level code understanding. Ada enters real codebases through a bounded tool interface, allowing open-ended exploration to remain recordable as finite trajectories. Across this wild-but-bounded setting, Ada chooses where to look, what to read closely, when to consolidate partial understanding, and when to close its account of the repository. We project Ada's think-action chains through observation lenses that make navigation, evidence selection, synthesis, grounding, and stopping visible without reducing behavior to raw tool counts or speculating about hidden intent. Read together, these lenses produce behavioral profiles grounded in recorded movement through software worlds. Across 408 trajectories, spanning multiple models, repositories, task families, and launch conditions, the study shows how faithful digital traces can be transformed into disciplined, comparable projections of emerging SWE-agent mindset. The results expose differences in efficiency, trajectory diversity, epistemic grounding, and the limits of intervention, while providing a methodological foundation for observing SWE agent behavior in real codebases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ada gives a bounded tool setup plus five lenses to turn SWE-agent trajectories into behavioral profiles, but the abstract leaves the lens definitions, trajectory collection, and validation undescribed so the claims cannot be checked.

read the letter

The main things to know are that the paper presents Ada as a scoped apparatus that enters real codebases via a bounded tool interface and then projects think-action chains through five lenses (navigation, evidence selection, synthesis, grounding, stopping) to create comparable mindset profiles, and that it reports doing this across 408 trajectories from multiple models and repos. The bounded interface plus the specific lens set is the concrete new piece; prior agent logging work exists, but this combination aimed at repository-level code understanding without raw counts or intent guesses appears fresh.

The paper does a reasonable job framing the core tension: trajectories are faithful traces yet do not explain choices or sufficiency judgments, so disciplined observation is needed. Framing the problem this way and proposing to treat the traces as an empirical substrate is useful for the SWE-agent interpretability discussion.

The soft spots sit in the methods. The abstract states that the lenses produce grounded, comparable profiles and expose differences in efficiency and epistemic grounding, yet gives no operational definitions for the lenses, no account of how the 408 trajectories were gathered or filtered, and no validation steps. This directly raises the stress-test point: mapping recorded actions to categories like synthesis versus evidence selection or deciding when stopping is grounded will almost certainly involve interpretive choices unless the mappings are fully mechanical. Without those details the profiles risk being sensitive to coder judgment, which undercuts the claim of disciplined comparability across models and repositories. The circularity concern also stands until the full text shows whether the projections are derived independently of the features used to define them.

This is for researchers working on SWE-agent evaluation, logging, or behavioral analysis who want concrete ways to observe tool-using agents inside actual codebases. A reader already thinking about trajectory interpretability would find the framing and lens names worth seeing. It deserves a serious referee because the underlying problem is live in the field and the proposed apparatus is a direct attempt to address it; the referee process can require the missing operational details and data description. I would send it to review rather than desk reject.

Referee Report

3 major / 1 minor

Summary. The paper introduces Ada, an apparatus for repository-level code understanding that enters real codebases via a bounded tool interface to generate finite, recordable trajectories. It applies five observation lenses (navigation, evidence selection, synthesis, grounding, stopping) to 408 trajectories spanning multiple models, repositories, task families, and launch conditions, claiming these lenses transform the trajectories into disciplined, comparable projections of emerging SWE-agent mindset without reducing behavior to raw tool counts or speculating on hidden intent. The results are said to expose differences in efficiency, trajectory diversity, epistemic grounding, and intervention limits while providing a methodological foundation.

Significance. If the lenses can be shown to be mechanically applicable and independent of coder judgment, the approach could supply a useful empirical substrate for characterizing tool-mediated agent behavior in software repositories, moving beyond raw logs toward observable behavioral profiles. The scale of 408 trajectories across varied conditions is a potential strength for comparability claims.

major comments (3)

[section on observation lenses] The section introducing the five observation lenses provides no explicit operational definitions, decision procedures, or mechanical mapping rules (e.g., regex on think-action logs or decision trees) for classifying segments as navigation vs. evidence selection vs. synthesis vs. grounding vs. stopping. This is load-bearing for the central claim that the projections are 'disciplined' and avoid speculation on hidden intent; without such rules the profiles remain sensitive to interpretive choices and cross-trajectory comparability cannot be verified.
[section describing the 408 trajectories] The section describing the 408 trajectories does not specify collection protocol, selection criteria, exact models/repositories/task families/launch conditions, or any filtering steps. This directly undermines the claim that the study spans multiple conditions and yields generalizable, comparable mindset projections grounded in recorded movement.
[validation or results section] No validation steps (inter-rater reliability, comparison of lens outputs to raw trajectory data, or sensitivity analysis) are reported for the application of the lenses. This is load-bearing because the weakest assumption—that the lenses make behavior visible without reducing to tool counts or requiring speculation—cannot be assessed without evidence that the mappings are reproducible.

minor comments (1)

[abstract] The abstract is overly dense; separating the apparatus description, lens definitions, and empirical claims into distinct sentences would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and for identifying specific gaps that affect the reproducibility and verifiability of our central claims. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [section on observation lenses] The section introducing the five observation lenses provides no explicit operational definitions, decision procedures, or mechanical mapping rules (e.g., regex on think-action logs or decision trees) for classifying segments as navigation vs. evidence selection vs. synthesis vs. grounding vs. stopping. This is load-bearing for the central claim that the projections are 'disciplined' and avoid speculation on hidden intent; without such rules the profiles remain sensitive to interpretive choices and cross-trajectory comparability cannot be verified.

Authors: We agree that the absence of explicit mechanical mapping rules weakens the claim that the lenses produce disciplined, comparable projections. In the revised manuscript we will insert a new subsection that supplies operational definitions and decision procedures for each lens, including pattern-matching rules on think-action logs and decision trees that map observable log features to the five categories without reference to inferred intent. revision: yes
Referee: [section describing the 408 trajectories] The section describing the 408 trajectories does not specify collection protocol, selection criteria, exact models/repositories/task families/launch conditions, or any filtering steps. This directly undermines the claim that the study spans multiple conditions and yields generalizable, comparable mindset projections grounded in recorded movement.

Authors: We accept that the current description is insufficient for reproducibility. The revised manuscript will expand the trajectories section with a dedicated protocol subsection that enumerates the exact models, repositories, task families, launch conditions, collection procedure, selection criteria, and any filtering applied to arrive at the 408 trajectories. revision: yes
Referee: [validation or results section] No validation steps (inter-rater reliability, comparison of lens outputs to raw trajectory data, or sensitivity analysis) are reported for the application of the lenses. This is load-bearing because the weakest assumption—that the lenses make behavior visible without reducing to tool counts or requiring speculation—cannot be assessed without evidence that the mappings are reproducible.

Authors: We acknowledge the lack of reported validation. We will add a validation subsection that reports inter-rater reliability on a sampled subset of trajectories, quantitative comparison of lens outputs against raw logs, and sensitivity analyses under varied mapping thresholds to demonstrate that the lens applications are reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity; methodological description is self-contained

full rationale

The paper presents a qualitative apparatus (Ada) and five observation lenses applied to recorded trajectories. No equations, fitted parameters, or derivations appear in the abstract or described structure. The central claim—that lenses produce comparable projections from think-action chains without reducing to tool counts or speculating on intent—is presented as a definitional methodological choice rather than a reduction to prior inputs or self-citations. No load-bearing step reduces by construction to its own outputs, and the work contains no self-citation chains or uniqueness theorems. This is the expected non-finding for a descriptive empirical study without quantitative modeling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the introduction of the Ada apparatus itself.

invented entities (1)

Ada no independent evidence
purpose: scoped apparatus for repository-level code understanding via bounded tool interface and observation lenses
New system introduced in the abstract to enable recordable open-ended exploration.

pith-pipeline@v0.9.1-grok · 5804 in / 1169 out tokens · 19064 ms · 2026-06-27T18:16:50.758409+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 20 canonical work pages · 2 internal anchors

[1]

Proceedings of the ACM on Programming Languages10(OOPSLA1), 1961–1988 (2026)

Liu, S., Chen, Y., Krishna, R., Sinha, S., Ganhotra, J., Jabbarvand, R.: Process- centric analysis of agentic software systems. Proceedings of the ACM on Programming Languages10(OOPSLA1), 1961–1988 (2026). https://doi.org/10. 1145/3798271

1961
[2]

Proceedings of the AAAI Conference on Artificial Intelligence39(28), 29634–29636 (2025)

Desmond, M., Lee, J.Y., Ibrahim, I., Johnson, J.M., Sil, A., MacNair, J., Puri, R.: Agent trajectory explorer: Visualizing and providing feedback on agent trajec- tories. Proceedings of the AAAI Conference on Artificial Intelligence39(28), 29634–29636 (2025). https://doi.org/10.1609/aaai.v39i28.35350

work page doi:10.1609/aaai.v39i28.35350 2025
[3]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Sys- tem Demonstrations, pp

Ou, T., Guo, W., Gandhi, A., Neubig, G., Yue, X.: AgentDiagnose: An open toolkit for diagnosing LLM agent trajectories. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Sys- tem Demonstrations, pp. 207–215. Association for Computational Linguistics, Suzhou, China (2025). https://doi.org/10.18653/v1/2025.emnlp-demos.15

work page doi:10.18653/v1/2025.emnlp-demos.15 2025
[4]

ReAct: Synergizing Reasoning and Acting in Language Models

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. arXiv (2023). https:// doi.org/10.48550/arXiv.2210.03629

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03629 2023
[5]

https://github.com/cline/ cline

Contributors, C.: Cline: Autonomous coding agent. https://github.com/cline/ cline. Open-source project, Apache 2.0 License (2024)

2024
[6]

(eds.): SWEBOK: Guide to the Software Engineering Body of Knowledge, Version 3.0 edn

Bourque, P., Fairley, R.E. (eds.): SWEBOK: Guide to the Software Engineering Body of Knowledge, Version 3.0 edn. IEEE Computer Society, Los Alamitos, CA (2014)

2014
[7]

The Quarterly Journal of Economics69(1), 99–118 (1955) https://arxiv.org/abs/1884852

Simon, H.A.: A behavioral model of rational choice. The Quarterly Journal of Economics69(1), 99–118 (1955) https://arxiv.org/abs/1884852. https://doi.org/ 10.2307/1884852

work page doi:10.2307/1884852 1955
[8]

In: Thirty-Seventh Con- ference on Neural Information Processing Systems (2023)

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K.R., Yao, S.: Reflexion: Language agents with verbal reinforcement learning. In: Thirty-Seventh Con- ference on Neural Information Processing Systems (2023)

2023
[9]

Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot

Bourtoule, L., Chandrasekaran, V., Choquette-Choo, C.A., Jia, H., Travers, A., Zhang, B., Lie, D., Papernot, N.: Machine unlearning. In: 2021 IEEE Symposium on Security and Privacy (SP), pp. 141–159. IEEE, San Francisco, CA, USA (2021). https://doi.org/10.1109/SP40001.2021.00019 Springer Nature 2021 LATEX template 56Projecting the Emerging Mindset of SWE Agent

work page doi:10.1109/sp40001.2021.00019 2021
[10]

In: Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2025)

Bouzenia, I., Pradel, M.: Understanding software engineering agents: A study of thought-action-result trajectories. In: Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2025). arXiv:2506.18824

arXiv 2025
[11]

In: The Eleventh International Conference on Learning Representations (2022)

Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., Chowd- hery, A., Zhou, D.: Self-consistency improves chain-of-thought reasoning in language models. In: The Eleventh International Conference on Learning Representations (2022)

2022
[12]

In: The Thirteenth International Conference on Learning Representations (2024)

Zhang, K., Yao, W., Liu, Z., Feng, Y., Liu, Z., N, R.R., Lan, T., Li, L., Lou, R., Xu, J., Pang, B., Zhou, Y., Heinecke, S., Savarese, S., Wang, H., Xiong, C.: Diversity empowers intelligence: Integrating expertise of software engineering agents. In: The Thirteenth International Conference on Learning Representations (2024)

2024
[13]

In: Workshop on Reasoning and Planning for Large Language Models (2025)

Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoharan, S., Nanda, N., Conmy, A.: Chain-of-thought reasoning in the wild is not always faithful. In: Workshop on Reasoning and Planning for Large Language Models (2025)

2025
[14]

When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

Zhou, S., Ling, R., Chen, J., Wang, X., Fan, T., Wang, H.: When more thinking hurts: Overthinking in LLM test-time compute scaling. arXiv (2026). https:// doi.org/10.48550/arXiv.2604.10739

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.10739 2026
[15]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

Jimenez, C., Lieret, K., Narasimhan, K., Press, O., Wettig, A., Yang, J., Yao, S.: SWE-agent: Agent-computer interfaces enable automated software engineer- ing. In: Advances in Neural Information Processing Systems 37, pp. 50528– 50652. Neural Information Processing Systems Foundation, Inc. (NeurIPS), Vancouver, BC, Canada (2024). https://doi.org/10.52202...

work page doi:10.52202/079017-1601 2024
[16]

In: Workshop on Scaling Environ- ments for Agents (2025)

Gandhi, S., Tsay, J., Ganhotra, J., Kate, K., Rizk, Y.: When agents go astray: Course-correcting SWE agents with PRMs. In: Workshop on Scaling Environ- ments for Agents (2025)

2025
[17]

arXiv:2509.09853 (2025)

Fan, Z., Vasilevski, K., Lin, D., Chen, B., Chen, Y., Zhong, Z., Zhang, J.M., He, P., Hassan, A.E.: SWE-Effi: Re-evaluating software AI agent system effectiveness under resource constraints. arXiv:2509.09853 (2025)

arXiv 2025
[18]

In: ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving (2026)

Li, H., Mang, Q., He, R., Zhang, Q., Mao, H., Chen, X., Zhou, H., Cheung, A., Gonzalez, J.E., Stoica, I.: Continuum: Efficient and robust multi-turn LLM agent scheduling with KV cache time-to-live. In: ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving (2026)

2026
[19]

In: Proceedings of the 16th Workshop on Hot Topics in Operating Systems, pp

Huang, P., Guo, C., Zhou, L., Lorch, J.R., Dang, Y., Chintalapati, M., Yao, R.: Gray failure: The Achilles’ heel of cloud-scale systems. In: Proceedings of the 16th Workshop on Hot Topics in Operating Systems, pp. 150–155. ACM, Whistler BC Canada (2017). https://doi.org/10.1145/3102980.3103005 Springer Nature 2021 LATEX template Projecting the Emerging Mi...

work page doi:10.1145/3102980.3103005 2017
[20]

https://arxiv.org/abs/2406.10162

Denison, C., MacDiarmid, M., Barez, F., Duvenaud, D., Kravec, S., Marks, S., Schiefer, N., Soklaski, R., Tamkin, A., Kaplan, J., Shlegeris, B., Bowman, S.R., Perez, E., Hubinger, E.: Sycophancy to subterfuge: Investigating reward- tampering in large language models (2024). https://arxiv.org/abs/2406.10162

Pith/arXiv arXiv 2024
[21]

arXiv:2603.09654 (2026)

Augenstein, I.: Understanding the interplay between LLMs’ utilisation of para- metric and contextual knowledge: A keynote at ECIR 2025. arXiv:2603.09654 (2026)

arXiv 2025
[22]

https://arxiv.org/abs/2602.01011

Pappu, A., El, B., Cao, H., di Nolfo, C., Sun, Y., Cao, M., Zou, J.: Multi-agent teams hold experts back (2026). https://arxiv.org/abs/2602.01011

Pith/arXiv arXiv 2026
[23]

https://doi.org/10.13140/RG.2.2.14475.96802

Sartori, C.C.: The specification gap: Coordination failure under partial knowl- edge in code agents (2026). https://doi.org/10.13140/RG.2.2.14475.96802

work page doi:10.13140/rg.2.2.14475.96802 2026
[24]

Applied Sciences16(10), 4914 (2026)

Maryanskyy, A., Budnikov, D., Kaliyev, A.T.: When agents disagree: The selec- tion bottleneck in multi-agent LLM pipelines. Applied Sciences16(10), 4914 (2026). https://doi.org/10.3390/app16104914

work page doi:10.3390/app16104914 2026
[25]

2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), 315–322 (2025)

Barrak, A.: Traceability and accountability in role-specialized multi-agent LLM pipelines. 2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), 315–322 (2025). https://doi.org/10. 1109/ASEW67777.2025.00064

arXiv 2025
[26]

arXiv:2510.02837 (2025)

Kim, W., Park, S., In, Y., Kim, S., Lee, D., Park, C.: Beyond the final answer: Eval- uating the reasoning trajectories of tool-augmented agents. arXiv:2510.02837 (2025)

Pith/arXiv arXiv 2025
[27]

In: The Thirteenth International Conference on Learning Representations (2024)

Gautam, D., Garg, S., Jang, J., Sundaresan, N., Moghaddam, R.Z.: Refactor- Bench: Evaluating stateful reasoning in language agents through code. In: The Thirteenth International Conference on Learning Representations (2024)

2024
[28]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025)

Yang, J., Lieret, K., Jimenez, C.E., Wettig, A., Khandpur, K., Zhang, Y., Hui, B., Press, O., Schmidt, L., Yang, D.: SWE-smith: Scaling data for software engineer- ing agents. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025)

2025
[29]

https://arxiv.org/abs/2508.18993

Ni, Z., Wang, H., Zhang, S., Lu, S., He, Z., You, W., Tang, Z., Du, Y., Sun, B., Liu, H., Hu, S., Chen, R., Li, B., Li, X., Hu, C., Jiao, B., Jiang, D., Lyu, P.: GitTaskBench: A benchmark for code agents solving real-world tasks through code repository leveraging (2025). https://arxiv.org/abs/2508.18993

arXiv 2025
[30]

arXiv:2504.08703 (2025) Springer Nature 2021 LATEX template 58Projecting the Emerging Mindset of SWE Agent

Rashid, M.S., Bock, C., Zhuang, Y., Buchholz, A., Esler, T., Valentin, S., Franceschi, L., Wistuba, M., Sivaprasad, P.T., Kim, W.J., Deoras, A., Zappella, G., Callot, L.: SWE-PolyBench: A multi-language benchmark for repository-level evaluation of coding agents. arXiv:2504.08703 (2025) Springer Nature 2021 LATEX template 58Projecting the Emerging Mindset ...

arXiv 2025
[31]

In: Chiruzzo, L., Ritter, A., Wang, L

Lu, J., Holleis, T., Zhang, Y., Aumayer, B., Nan, F., Bai, H., Ma, S., Ma, S., Li, M., Yin, G., Wang, Z., Pang, R.: ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) Findings of the Association for Computational Linguistics: NAACL 2025, pp. 1160–1183. Asso...

work page doi:10.18653/v1/2025.findings-naacl 2025
[32]

In: The Thirteenth Inter- national Conference on Learning Representations (2024)

Yao, S., Shinn, N., Razavi, P., Narasimhan, K.R.:τ-bench: A benchmark for Tool-Agent-User interaction in real-world domains. In: The Thirteenth Inter- national Conference on Learning Representations (2024)

2024
[33]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Wang, H., Huang, W., Wang, Y., Xi, Y., Lu, J., Zhang, H., Hu, N., Liu, Z., Pan, J.Z., Wong, K.-F.: Rethinking stateful tool use in multi-turn dialogues: Bench- marks and challenges. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025, pp. 5433–5453. Association for Computational ...

work page doi:10.18653/v1/2025.findings-acl.284 2025
[34]

In: The Twelfth International Conference on Learn- ing Representations (2023)

Liu, T., Xu, C., McAuley, J.: RepoBench: Benchmarking repository-level code auto-completion systems. In: The Twelfth International Conference on Learn- ing Representations (2023)

2023
[35]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Ding, Y., Wang, Z., Ahmad, W., Ding, H., Tan, M., Jain, N., Ramanathan, M.K., Nallapati, R., Bhatia, P., Roth, D., Xiang, B.: CrossCodeEval: A diverse and mul- tilingual benchmark for cross-file code completion. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Infor- mation Processing Systems, vol. 36, pp...

2023
[36]

Proceedings of the ACM on Software Engineering1(FSE), 675–698 (2024)

Bairi, R., Sonwane, A., Kanade, A., C., V.D., Iyer, A., Parthasarathy, S., Raja- mani, S., Ashok, B., Shet, S.: CodePlan: Repository-level coding using LLMs and planning. Proceedings of the ACM on Software Engineering1(FSE), 675–698 (2024). https://doi.org/10.1145/3643757

work page doi:10.1145/3643757 2024
[37]

LMMs-eval: Reality check on the evaluation of large multimodal models

Du, J., Liu, Y., Guo, H., Wang, J., Huang, H., Ni, Y., Li, Z.: DependEval: Benchmarking LLMs for repository dependency understanding. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025, pp. 7150–7179. Association for Compu- tational Linguistics, Vienna, Austria (2025). https://d...

work page doi:10.18653/v1/2025 2025
[38]

https://arxiv.org/abs/ 2509.14635

Peng, W., Shi, Y., Wang, Y., Zhang, X., Shen, B., Gu, X.: SWE-QA: Can language models answer repository-level code questions? (2026). https://arxiv.org/abs/ 2509.14635

Pith/arXiv arXiv 2026
[39]

In: Proceedings of the 3rd ACM SIGSOFT Symposium on Foundations of Software Engineering, pp

Murphy, G.C., Notkin, D., Sullivan, K.: Software reflexion models: Bridging the Springer Nature 2021 LATEX template Projecting the Emerging Mindset of SWE Agent59 gap between source and high-level models. In: Proceedings of the 3rd ACM SIGSOFT Symposium on Foundations of Software Engineering, pp. 18–28. ACM, Washington D.C. USA (1995). https://doi.org/10....

work page doi:10.1145/222124.222136 2021
[40]

IEEE Transactions on Software Engineering35(4), 573–591 (2009)

Ducasse, S., Pollet, D.: Software architecture reconstruction: A process- oriented taxonomy. IEEE Transactions on Software Engineering35(4), 573–591 (2009). https://doi.org/10.1109/TSE.2009.19

work page doi:10.1109/tse.2009.19 2009
[41]

Technical report, Defense Technical Information Center, Fort Belvoir, V A (August 2000)

Kazman, R., Klein, M., Clements, P.: ATAM: Method for architecture evalua- tion. Technical report, Defense Technical Information Center, Fort Belvoir, V A (August 2000). https://doi.org/10.21236/ADA382629

work page doi:10.21236/ada382629 2000
[42]

IEEE Software30(2), 38–45 (2013)

Chen, L., Ali Babar, M., Nuseibeh, B.: Characterizing architecturally significant requirements. IEEE Software30(2), 38–45 (2013). https://doi.org/10.1109/MS. 2012.174

work page doi:10.1109/ms 2013
[43]

IEEE Transactions on Software Engineering29(3), 210–224 (2003)

Eisenbarth, T., Koschke, R., Simon, D.: Locating features in source code. IEEE Transactions on Software Engineering29(3), 210–224 (2003). https://doi.org/ 10.1109/TSE.2003.1183929

work page doi:10.1109/tse.2003.1183929 2003
[44]

IBM Systems Journal15(3), 182–211 (1976)

Fagan, M.E.: Design and code inspections to reduce errors in program devel- opment. IBM Systems Journal15(3), 182–211 (1976). https://doi.org/10.1147/sj. 153.0182

work page doi:10.1147/sj 1976
[45]

In: 2013 35th International Conference on Software Engineering (ICSE), pp

Bacchelli, A., Bird, C.: Expectations, outcomes, and challenges of modern code review. In: 2013 35th International Conference on Software Engineering (ICSE), pp. 712–721. IEEE, San Francisco, CA, USA (2013). https://doi.org/10.1109/ ICSE.2013.6606617

arXiv 2013

[1] [1]

Proceedings of the ACM on Programming Languages10(OOPSLA1), 1961–1988 (2026)

Liu, S., Chen, Y., Krishna, R., Sinha, S., Ganhotra, J., Jabbarvand, R.: Process- centric analysis of agentic software systems. Proceedings of the ACM on Programming Languages10(OOPSLA1), 1961–1988 (2026). https://doi.org/10. 1145/3798271

1961

[2] [2]

Proceedings of the AAAI Conference on Artificial Intelligence39(28), 29634–29636 (2025)

Desmond, M., Lee, J.Y., Ibrahim, I., Johnson, J.M., Sil, A., MacNair, J., Puri, R.: Agent trajectory explorer: Visualizing and providing feedback on agent trajec- tories. Proceedings of the AAAI Conference on Artificial Intelligence39(28), 29634–29636 (2025). https://doi.org/10.1609/aaai.v39i28.35350

work page doi:10.1609/aaai.v39i28.35350 2025

[3] [3]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Sys- tem Demonstrations, pp

Ou, T., Guo, W., Gandhi, A., Neubig, G., Yue, X.: AgentDiagnose: An open toolkit for diagnosing LLM agent trajectories. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Sys- tem Demonstrations, pp. 207–215. Association for Computational Linguistics, Suzhou, China (2025). https://doi.org/10.18653/v1/2025.emnlp-demos.15

work page doi:10.18653/v1/2025.emnlp-demos.15 2025

[4] [4]

ReAct: Synergizing Reasoning and Acting in Language Models

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. arXiv (2023). https:// doi.org/10.48550/arXiv.2210.03629

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03629 2023

[5] [5]

https://github.com/cline/ cline

Contributors, C.: Cline: Autonomous coding agent. https://github.com/cline/ cline. Open-source project, Apache 2.0 License (2024)

2024

[6] [6]

(eds.): SWEBOK: Guide to the Software Engineering Body of Knowledge, Version 3.0 edn

Bourque, P., Fairley, R.E. (eds.): SWEBOK: Guide to the Software Engineering Body of Knowledge, Version 3.0 edn. IEEE Computer Society, Los Alamitos, CA (2014)

2014

[7] [7]

The Quarterly Journal of Economics69(1), 99–118 (1955) https://arxiv.org/abs/1884852

Simon, H.A.: A behavioral model of rational choice. The Quarterly Journal of Economics69(1), 99–118 (1955) https://arxiv.org/abs/1884852. https://doi.org/ 10.2307/1884852

work page doi:10.2307/1884852 1955

[8] [8]

In: Thirty-Seventh Con- ference on Neural Information Processing Systems (2023)

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K.R., Yao, S.: Reflexion: Language agents with verbal reinforcement learning. In: Thirty-Seventh Con- ference on Neural Information Processing Systems (2023)

2023

[9] [9]

Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot

Bourtoule, L., Chandrasekaran, V., Choquette-Choo, C.A., Jia, H., Travers, A., Zhang, B., Lie, D., Papernot, N.: Machine unlearning. In: 2021 IEEE Symposium on Security and Privacy (SP), pp. 141–159. IEEE, San Francisco, CA, USA (2021). https://doi.org/10.1109/SP40001.2021.00019 Springer Nature 2021 LATEX template 56Projecting the Emerging Mindset of SWE Agent

work page doi:10.1109/sp40001.2021.00019 2021

[10] [10]

In: Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2025)

Bouzenia, I., Pradel, M.: Understanding software engineering agents: A study of thought-action-result trajectories. In: Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2025). arXiv:2506.18824

arXiv 2025

[11] [11]

In: The Eleventh International Conference on Learning Representations (2022)

Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., Chowd- hery, A., Zhou, D.: Self-consistency improves chain-of-thought reasoning in language models. In: The Eleventh International Conference on Learning Representations (2022)

2022

[12] [12]

In: The Thirteenth International Conference on Learning Representations (2024)

Zhang, K., Yao, W., Liu, Z., Feng, Y., Liu, Z., N, R.R., Lan, T., Li, L., Lou, R., Xu, J., Pang, B., Zhou, Y., Heinecke, S., Savarese, S., Wang, H., Xiong, C.: Diversity empowers intelligence: Integrating expertise of software engineering agents. In: The Thirteenth International Conference on Learning Representations (2024)

2024

[13] [13]

In: Workshop on Reasoning and Planning for Large Language Models (2025)

Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoharan, S., Nanda, N., Conmy, A.: Chain-of-thought reasoning in the wild is not always faithful. In: Workshop on Reasoning and Planning for Large Language Models (2025)

2025

[14] [14]

When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

Zhou, S., Ling, R., Chen, J., Wang, X., Fan, T., Wang, H.: When more thinking hurts: Overthinking in LLM test-time compute scaling. arXiv (2026). https:// doi.org/10.48550/arXiv.2604.10739

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.10739 2026

[15] [15]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

Jimenez, C., Lieret, K., Narasimhan, K., Press, O., Wettig, A., Yang, J., Yao, S.: SWE-agent: Agent-computer interfaces enable automated software engineer- ing. In: Advances in Neural Information Processing Systems 37, pp. 50528– 50652. Neural Information Processing Systems Foundation, Inc. (NeurIPS), Vancouver, BC, Canada (2024). https://doi.org/10.52202...

work page doi:10.52202/079017-1601 2024

[16] [16]

In: Workshop on Scaling Environ- ments for Agents (2025)

Gandhi, S., Tsay, J., Ganhotra, J., Kate, K., Rizk, Y.: When agents go astray: Course-correcting SWE agents with PRMs. In: Workshop on Scaling Environ- ments for Agents (2025)

2025

[17] [17]

arXiv:2509.09853 (2025)

Fan, Z., Vasilevski, K., Lin, D., Chen, B., Chen, Y., Zhong, Z., Zhang, J.M., He, P., Hassan, A.E.: SWE-Effi: Re-evaluating software AI agent system effectiveness under resource constraints. arXiv:2509.09853 (2025)

arXiv 2025

[18] [18]

In: ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving (2026)

Li, H., Mang, Q., He, R., Zhang, Q., Mao, H., Chen, X., Zhou, H., Cheung, A., Gonzalez, J.E., Stoica, I.: Continuum: Efficient and robust multi-turn LLM agent scheduling with KV cache time-to-live. In: ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving (2026)

2026

[19] [19]

In: Proceedings of the 16th Workshop on Hot Topics in Operating Systems, pp

Huang, P., Guo, C., Zhou, L., Lorch, J.R., Dang, Y., Chintalapati, M., Yao, R.: Gray failure: The Achilles’ heel of cloud-scale systems. In: Proceedings of the 16th Workshop on Hot Topics in Operating Systems, pp. 150–155. ACM, Whistler BC Canada (2017). https://doi.org/10.1145/3102980.3103005 Springer Nature 2021 LATEX template Projecting the Emerging Mi...

work page doi:10.1145/3102980.3103005 2017

[20] [20]

https://arxiv.org/abs/2406.10162

Denison, C., MacDiarmid, M., Barez, F., Duvenaud, D., Kravec, S., Marks, S., Schiefer, N., Soklaski, R., Tamkin, A., Kaplan, J., Shlegeris, B., Bowman, S.R., Perez, E., Hubinger, E.: Sycophancy to subterfuge: Investigating reward- tampering in large language models (2024). https://arxiv.org/abs/2406.10162

Pith/arXiv arXiv 2024

[21] [21]

arXiv:2603.09654 (2026)

Augenstein, I.: Understanding the interplay between LLMs’ utilisation of para- metric and contextual knowledge: A keynote at ECIR 2025. arXiv:2603.09654 (2026)

arXiv 2025

[22] [22]

https://arxiv.org/abs/2602.01011

Pappu, A., El, B., Cao, H., di Nolfo, C., Sun, Y., Cao, M., Zou, J.: Multi-agent teams hold experts back (2026). https://arxiv.org/abs/2602.01011

Pith/arXiv arXiv 2026

[23] [23]

https://doi.org/10.13140/RG.2.2.14475.96802

Sartori, C.C.: The specification gap: Coordination failure under partial knowl- edge in code agents (2026). https://doi.org/10.13140/RG.2.2.14475.96802

work page doi:10.13140/rg.2.2.14475.96802 2026

[24] [24]

Applied Sciences16(10), 4914 (2026)

Maryanskyy, A., Budnikov, D., Kaliyev, A.T.: When agents disagree: The selec- tion bottleneck in multi-agent LLM pipelines. Applied Sciences16(10), 4914 (2026). https://doi.org/10.3390/app16104914

work page doi:10.3390/app16104914 2026

[25] [25]

2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), 315–322 (2025)

Barrak, A.: Traceability and accountability in role-specialized multi-agent LLM pipelines. 2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), 315–322 (2025). https://doi.org/10. 1109/ASEW67777.2025.00064

arXiv 2025

[26] [26]

arXiv:2510.02837 (2025)

Kim, W., Park, S., In, Y., Kim, S., Lee, D., Park, C.: Beyond the final answer: Eval- uating the reasoning trajectories of tool-augmented agents. arXiv:2510.02837 (2025)

Pith/arXiv arXiv 2025

[27] [27]

In: The Thirteenth International Conference on Learning Representations (2024)

Gautam, D., Garg, S., Jang, J., Sundaresan, N., Moghaddam, R.Z.: Refactor- Bench: Evaluating stateful reasoning in language agents through code. In: The Thirteenth International Conference on Learning Representations (2024)

2024

[28] [28]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025)

Yang, J., Lieret, K., Jimenez, C.E., Wettig, A., Khandpur, K., Zhang, Y., Hui, B., Press, O., Schmidt, L., Yang, D.: SWE-smith: Scaling data for software engineer- ing agents. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025)

2025

[29] [29]

https://arxiv.org/abs/2508.18993

Ni, Z., Wang, H., Zhang, S., Lu, S., He, Z., You, W., Tang, Z., Du, Y., Sun, B., Liu, H., Hu, S., Chen, R., Li, B., Li, X., Hu, C., Jiao, B., Jiang, D., Lyu, P.: GitTaskBench: A benchmark for code agents solving real-world tasks through code repository leveraging (2025). https://arxiv.org/abs/2508.18993

arXiv 2025

[30] [30]

arXiv:2504.08703 (2025) Springer Nature 2021 LATEX template 58Projecting the Emerging Mindset of SWE Agent

Rashid, M.S., Bock, C., Zhuang, Y., Buchholz, A., Esler, T., Valentin, S., Franceschi, L., Wistuba, M., Sivaprasad, P.T., Kim, W.J., Deoras, A., Zappella, G., Callot, L.: SWE-PolyBench: A multi-language benchmark for repository-level evaluation of coding agents. arXiv:2504.08703 (2025) Springer Nature 2021 LATEX template 58Projecting the Emerging Mindset ...

arXiv 2025

[31] [31]

In: Chiruzzo, L., Ritter, A., Wang, L

Lu, J., Holleis, T., Zhang, Y., Aumayer, B., Nan, F., Bai, H., Ma, S., Ma, S., Li, M., Yin, G., Wang, Z., Pang, R.: ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) Findings of the Association for Computational Linguistics: NAACL 2025, pp. 1160–1183. Asso...

work page doi:10.18653/v1/2025.findings-naacl 2025

[32] [32]

In: The Thirteenth Inter- national Conference on Learning Representations (2024)

Yao, S., Shinn, N., Razavi, P., Narasimhan, K.R.:τ-bench: A benchmark for Tool-Agent-User interaction in real-world domains. In: The Thirteenth Inter- national Conference on Learning Representations (2024)

2024

[33] [33]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Wang, H., Huang, W., Wang, Y., Xi, Y., Lu, J., Zhang, H., Hu, N., Liu, Z., Pan, J.Z., Wong, K.-F.: Rethinking stateful tool use in multi-turn dialogues: Bench- marks and challenges. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025, pp. 5433–5453. Association for Computational ...

work page doi:10.18653/v1/2025.findings-acl.284 2025

[34] [34]

In: The Twelfth International Conference on Learn- ing Representations (2023)

Liu, T., Xu, C., McAuley, J.: RepoBench: Benchmarking repository-level code auto-completion systems. In: The Twelfth International Conference on Learn- ing Representations (2023)

2023

[35] [35]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Ding, Y., Wang, Z., Ahmad, W., Ding, H., Tan, M., Jain, N., Ramanathan, M.K., Nallapati, R., Bhatia, P., Roth, D., Xiang, B.: CrossCodeEval: A diverse and mul- tilingual benchmark for cross-file code completion. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Infor- mation Processing Systems, vol. 36, pp...

2023

[36] [36]

Proceedings of the ACM on Software Engineering1(FSE), 675–698 (2024)

Bairi, R., Sonwane, A., Kanade, A., C., V.D., Iyer, A., Parthasarathy, S., Raja- mani, S., Ashok, B., Shet, S.: CodePlan: Repository-level coding using LLMs and planning. Proceedings of the ACM on Software Engineering1(FSE), 675–698 (2024). https://doi.org/10.1145/3643757

work page doi:10.1145/3643757 2024

[37] [37]

LMMs-eval: Reality check on the evaluation of large multimodal models

Du, J., Liu, Y., Guo, H., Wang, J., Huang, H., Ni, Y., Li, Z.: DependEval: Benchmarking LLMs for repository dependency understanding. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025, pp. 7150–7179. Association for Compu- tational Linguistics, Vienna, Austria (2025). https://d...

work page doi:10.18653/v1/2025 2025

[38] [38]

https://arxiv.org/abs/ 2509.14635

Peng, W., Shi, Y., Wang, Y., Zhang, X., Shen, B., Gu, X.: SWE-QA: Can language models answer repository-level code questions? (2026). https://arxiv.org/abs/ 2509.14635

Pith/arXiv arXiv 2026

[39] [39]

In: Proceedings of the 3rd ACM SIGSOFT Symposium on Foundations of Software Engineering, pp

Murphy, G.C., Notkin, D., Sullivan, K.: Software reflexion models: Bridging the Springer Nature 2021 LATEX template Projecting the Emerging Mindset of SWE Agent59 gap between source and high-level models. In: Proceedings of the 3rd ACM SIGSOFT Symposium on Foundations of Software Engineering, pp. 18–28. ACM, Washington D.C. USA (1995). https://doi.org/10....

work page doi:10.1145/222124.222136 2021

[40] [40]

IEEE Transactions on Software Engineering35(4), 573–591 (2009)

Ducasse, S., Pollet, D.: Software architecture reconstruction: A process- oriented taxonomy. IEEE Transactions on Software Engineering35(4), 573–591 (2009). https://doi.org/10.1109/TSE.2009.19

work page doi:10.1109/tse.2009.19 2009

[41] [41]

Technical report, Defense Technical Information Center, Fort Belvoir, V A (August 2000)

Kazman, R., Klein, M., Clements, P.: ATAM: Method for architecture evalua- tion. Technical report, Defense Technical Information Center, Fort Belvoir, V A (August 2000). https://doi.org/10.21236/ADA382629

work page doi:10.21236/ada382629 2000

[42] [42]

IEEE Software30(2), 38–45 (2013)

Chen, L., Ali Babar, M., Nuseibeh, B.: Characterizing architecturally significant requirements. IEEE Software30(2), 38–45 (2013). https://doi.org/10.1109/MS. 2012.174

work page doi:10.1109/ms 2013

[43] [43]

IEEE Transactions on Software Engineering29(3), 210–224 (2003)

Eisenbarth, T., Koschke, R., Simon, D.: Locating features in source code. IEEE Transactions on Software Engineering29(3), 210–224 (2003). https://doi.org/ 10.1109/TSE.2003.1183929

work page doi:10.1109/tse.2003.1183929 2003

[44] [44]

IBM Systems Journal15(3), 182–211 (1976)

Fagan, M.E.: Design and code inspections to reduce errors in program devel- opment. IBM Systems Journal15(3), 182–211 (1976). https://doi.org/10.1147/sj. 153.0182

work page doi:10.1147/sj 1976

[45] [45]

In: 2013 35th International Conference on Software Engineering (ICSE), pp

Bacchelli, A., Bird, C.: Expectations, outcomes, and challenges of modern code review. In: 2013 35th International Conference on Software Engineering (ICSE), pp. 712–721. IEEE, San Francisco, CA, USA (2013). https://doi.org/10.1109/ ICSE.2013.6606617

arXiv 2013