pith. sign in

arxiv: 2606.08500 · v1 · pith:4TMSEGNKnew · submitted 2026-06-07 · 💻 cs.SE · cs.AI

Projecting the Emerging Mindset of SWE Agent by Launching a Wild Code Understanding Journey

Pith reviewed 2026-06-27 18:16 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords SWE agentscode understandingagent trajectoriesobservation lensesrepository explorationbehavioral profilestool-mediated agentsepistemic grounding
0
0 comments X

The pith

SWE-agent trajectories become comparable behavioral profiles when read through five observation lenses on navigation, evidence, synthesis, grounding, and stopping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Ada, an apparatus that explores real code repositories through a bounded set of tools while recording every think-action step as a finite trajectory. Five observation lenses turn those raw traces into visible records of how the agent chooses where to look, what evidence to trust, when to consolidate understanding, and when to stop. Across 408 trajectories collected from multiple models, repositories, and task conditions, the lenses produce disciplined profiles of agent behavior without collapsing to tool counts or guessing at unrecorded intent. The resulting profiles reveal measurable differences in efficiency, diversity of paths, and epistemic grounding. The work supplies a repeatable method for turning faithful digital traces into projections of an emerging SWE-agent mindset.

Core claim

Ada enters real codebases through a bounded tool interface, allowing open-ended exploration to remain recordable as finite trajectories. We project Ada's think-action chains through observation lenses that make navigation, evidence selection, synthesis, grounding, and stopping visible without reducing behavior to raw tool counts or speculating about hidden intent. Read together, these lenses produce behavioral profiles grounded in recorded movement through software worlds. Across 408 trajectories, spanning multiple models, repositories, task families, and launch conditions, the study shows how faithful digital traces can be transformed into disciplined, comparable projections of emerging SWE

What carries the argument

Ada, a scoped apparatus that explores repositories via a bounded tool interface and projects think-action trajectories through five observation lenses (navigation, evidence selection, synthesis, grounding, stopping) to generate comparable behavioral profiles.

If this is right

  • Efficiency differences across models become measurable through the same lens-derived profiles rather than post-hoc inspection.
  • Trajectory diversity and the degree of epistemic grounding can be compared directly across launch conditions.
  • Limits on how much external intervention can alter agent stopping behavior become observable in the profiles.
  • The method supplies a repeatable foundation for studying SWE-agent behavior inside actual codebases instead of toy environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The lens approach could scale to automated scoring of thousands of agent runs if the five categories are encoded as classifiers.
  • Profiles might later be used to diagnose why one agent succeeds on a task while another fails by tracing differences in grounding steps.
  • The bounded-interface design suggests a template for other domains where agents must explore large state spaces without unbounded tool access.

Load-bearing premise

The five observation lenses can render agent behavior visible and comparable without collapsing it into raw tool counts or requiring guesses about hidden intent.

What would settle it

Apply the same five lenses to a fresh set of trajectories from a different model or repository and check whether the resulting profiles lose all ability to distinguish models or task conditions while still matching independent performance metrics.

read the original abstract

Software engineering agents (SWE agents) increasingly work through tool-mediated trajectories in real repositories, yet their behavior remains difficult to characterize in concrete, observable terms. These trajectories record tool use, intermediate reasoning, evidence selection, and self-directed stopping, but they do not by themselves explain why particular moves were chosen, what evidence was trusted, or when understanding was judged sufficient. This tension makes trajectory data both limited and valuable: faithful, replayable traces can become an empirical substrate for studying agent behavior when interpreted through disciplined observation. We introduce Ada, a scoped apparatus for repository-level code understanding. Ada enters real codebases through a bounded tool interface, allowing open-ended exploration to remain recordable as finite trajectories. Across this wild-but-bounded setting, Ada chooses where to look, what to read closely, when to consolidate partial understanding, and when to close its account of the repository. We project Ada's think-action chains through observation lenses that make navigation, evidence selection, synthesis, grounding, and stopping visible without reducing behavior to raw tool counts or speculating about hidden intent. Read together, these lenses produce behavioral profiles grounded in recorded movement through software worlds. Across 408 trajectories, spanning multiple models, repositories, task families, and launch conditions, the study shows how faithful digital traces can be transformed into disciplined, comparable projections of emerging SWE-agent mindset. The results expose differences in efficiency, trajectory diversity, epistemic grounding, and the limits of intervention, while providing a methodological foundation for observing SWE agent behavior in real codebases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces Ada, an apparatus for repository-level code understanding that enters real codebases via a bounded tool interface to generate finite, recordable trajectories. It applies five observation lenses (navigation, evidence selection, synthesis, grounding, stopping) to 408 trajectories spanning multiple models, repositories, task families, and launch conditions, claiming these lenses transform the trajectories into disciplined, comparable projections of emerging SWE-agent mindset without reducing behavior to raw tool counts or speculating on hidden intent. The results are said to expose differences in efficiency, trajectory diversity, epistemic grounding, and intervention limits while providing a methodological foundation.

Significance. If the lenses can be shown to be mechanically applicable and independent of coder judgment, the approach could supply a useful empirical substrate for characterizing tool-mediated agent behavior in software repositories, moving beyond raw logs toward observable behavioral profiles. The scale of 408 trajectories across varied conditions is a potential strength for comparability claims.

major comments (3)
  1. [section on observation lenses] The section introducing the five observation lenses provides no explicit operational definitions, decision procedures, or mechanical mapping rules (e.g., regex on think-action logs or decision trees) for classifying segments as navigation vs. evidence selection vs. synthesis vs. grounding vs. stopping. This is load-bearing for the central claim that the projections are 'disciplined' and avoid speculation on hidden intent; without such rules the profiles remain sensitive to interpretive choices and cross-trajectory comparability cannot be verified.
  2. [section describing the 408 trajectories] The section describing the 408 trajectories does not specify collection protocol, selection criteria, exact models/repositories/task families/launch conditions, or any filtering steps. This directly undermines the claim that the study spans multiple conditions and yields generalizable, comparable mindset projections grounded in recorded movement.
  3. [validation or results section] No validation steps (inter-rater reliability, comparison of lens outputs to raw trajectory data, or sensitivity analysis) are reported for the application of the lenses. This is load-bearing because the weakest assumption—that the lenses make behavior visible without reducing to tool counts or requiring speculation—cannot be assessed without evidence that the mappings are reproducible.
minor comments (1)
  1. [abstract] The abstract is overly dense; separating the apparatus description, lens definitions, and empirical claims into distinct sentences would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and for identifying specific gaps that affect the reproducibility and verifiability of our central claims. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [section on observation lenses] The section introducing the five observation lenses provides no explicit operational definitions, decision procedures, or mechanical mapping rules (e.g., regex on think-action logs or decision trees) for classifying segments as navigation vs. evidence selection vs. synthesis vs. grounding vs. stopping. This is load-bearing for the central claim that the projections are 'disciplined' and avoid speculation on hidden intent; without such rules the profiles remain sensitive to interpretive choices and cross-trajectory comparability cannot be verified.

    Authors: We agree that the absence of explicit mechanical mapping rules weakens the claim that the lenses produce disciplined, comparable projections. In the revised manuscript we will insert a new subsection that supplies operational definitions and decision procedures for each lens, including pattern-matching rules on think-action logs and decision trees that map observable log features to the five categories without reference to inferred intent. revision: yes

  2. Referee: [section describing the 408 trajectories] The section describing the 408 trajectories does not specify collection protocol, selection criteria, exact models/repositories/task families/launch conditions, or any filtering steps. This directly undermines the claim that the study spans multiple conditions and yields generalizable, comparable mindset projections grounded in recorded movement.

    Authors: We accept that the current description is insufficient for reproducibility. The revised manuscript will expand the trajectories section with a dedicated protocol subsection that enumerates the exact models, repositories, task families, launch conditions, collection procedure, selection criteria, and any filtering applied to arrive at the 408 trajectories. revision: yes

  3. Referee: [validation or results section] No validation steps (inter-rater reliability, comparison of lens outputs to raw trajectory data, or sensitivity analysis) are reported for the application of the lenses. This is load-bearing because the weakest assumption—that the lenses make behavior visible without reducing to tool counts or requiring speculation—cannot be assessed without evidence that the mappings are reproducible.

    Authors: We acknowledge the lack of reported validation. We will add a validation subsection that reports inter-rater reliability on a sampled subset of trajectories, quantitative comparison of lens outputs against raw logs, and sensitivity analyses under varied mapping thresholds to demonstrate that the lens applications are reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity; methodological description is self-contained

full rationale

The paper presents a qualitative apparatus (Ada) and five observation lenses applied to recorded trajectories. No equations, fitted parameters, or derivations appear in the abstract or described structure. The central claim—that lenses produce comparable projections from think-action chains without reducing to tool counts or speculating on intent—is presented as a definitional methodological choice rather than a reduction to prior inputs or self-citations. No load-bearing step reduces by construction to its own outputs, and the work contains no self-citation chains or uniqueness theorems. This is the expected non-finding for a descriptive empirical study without quantitative modeling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the introduction of the Ada apparatus itself.

invented entities (1)
  • Ada no independent evidence
    purpose: scoped apparatus for repository-level code understanding via bounded tool interface and observation lenses
    New system introduced in the abstract to enable recordable open-ended exploration.

pith-pipeline@v0.9.1-grok · 5804 in / 1169 out tokens · 19064 ms · 2026-06-27T18:16:50.758409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 20 canonical work pages · 2 internal anchors

  1. [1]

    Proceedings of the ACM on Programming Languages10(OOPSLA1), 1961–1988 (2026)

    Liu, S., Chen, Y., Krishna, R., Sinha, S., Ganhotra, J., Jabbarvand, R.: Process- centric analysis of agentic software systems. Proceedings of the ACM on Programming Languages10(OOPSLA1), 1961–1988 (2026). https://doi.org/10. 1145/3798271

  2. [2]

    Proceedings of the AAAI Conference on Artificial Intelligence39(28), 29634–29636 (2025)

    Desmond, M., Lee, J.Y., Ibrahim, I., Johnson, J.M., Sil, A., MacNair, J., Puri, R.: Agent trajectory explorer: Visualizing and providing feedback on agent trajec- tories. Proceedings of the AAAI Conference on Artificial Intelligence39(28), 29634–29636 (2025). https://doi.org/10.1609/aaai.v39i28.35350

  3. [3]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Sys- tem Demonstrations, pp

    Ou, T., Guo, W., Gandhi, A., Neubig, G., Yue, X.: AgentDiagnose: An open toolkit for diagnosing LLM agent trajectories. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Sys- tem Demonstrations, pp. 207–215. Association for Computational Linguistics, Suzhou, China (2025). https://doi.org/10.18653/v1/2025.emnlp-demos.15

  4. [4]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. arXiv (2023). https:// doi.org/10.48550/arXiv.2210.03629

  5. [5]

    https://github.com/cline/ cline

    Contributors, C.: Cline: Autonomous coding agent. https://github.com/cline/ cline. Open-source project, Apache 2.0 License (2024)

  6. [6]

    (eds.): SWEBOK: Guide to the Software Engineering Body of Knowledge, Version 3.0 edn

    Bourque, P., Fairley, R.E. (eds.): SWEBOK: Guide to the Software Engineering Body of Knowledge, Version 3.0 edn. IEEE Computer Society, Los Alamitos, CA (2014)

  7. [7]

    The Quarterly Journal of Economics69(1), 99–118 (1955) https://arxiv.org/abs/1884852

    Simon, H.A.: A behavioral model of rational choice. The Quarterly Journal of Economics69(1), 99–118 (1955) https://arxiv.org/abs/1884852. https://doi.org/ 10.2307/1884852

  8. [8]

    In: Thirty-Seventh Con- ference on Neural Information Processing Systems (2023)

    Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K.R., Yao, S.: Reflexion: Language agents with verbal reinforcement learning. In: Thirty-Seventh Con- ference on Neural Information Processing Systems (2023)

  9. [9]

    Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot

    Bourtoule, L., Chandrasekaran, V., Choquette-Choo, C.A., Jia, H., Travers, A., Zhang, B., Lie, D., Papernot, N.: Machine unlearning. In: 2021 IEEE Symposium on Security and Privacy (SP), pp. 141–159. IEEE, San Francisco, CA, USA (2021). https://doi.org/10.1109/SP40001.2021.00019 Springer Nature 2021 LATEX template 56Projecting the Emerging Mindset of SWE Agent

  10. [10]

    In: Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2025)

    Bouzenia, I., Pradel, M.: Understanding software engineering agents: A study of thought-action-result trajectories. In: Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2025). arXiv:2506.18824

  11. [11]

    In: The Eleventh International Conference on Learning Representations (2022)

    Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., Chowd- hery, A., Zhou, D.: Self-consistency improves chain-of-thought reasoning in language models. In: The Eleventh International Conference on Learning Representations (2022)

  12. [12]

    In: The Thirteenth International Conference on Learning Representations (2024)

    Zhang, K., Yao, W., Liu, Z., Feng, Y., Liu, Z., N, R.R., Lan, T., Li, L., Lou, R., Xu, J., Pang, B., Zhou, Y., Heinecke, S., Savarese, S., Wang, H., Xiong, C.: Diversity empowers intelligence: Integrating expertise of software engineering agents. In: The Thirteenth International Conference on Learning Representations (2024)

  13. [13]

    In: Workshop on Reasoning and Planning for Large Language Models (2025)

    Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoharan, S., Nanda, N., Conmy, A.: Chain-of-thought reasoning in the wild is not always faithful. In: Workshop on Reasoning and Planning for Large Language Models (2025)

  14. [14]

    When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

    Zhou, S., Ling, R., Chen, J., Wang, X., Fan, T., Wang, H.: When more thinking hurts: Overthinking in LLM test-time compute scaling. arXiv (2026). https:// doi.org/10.48550/arXiv.2604.10739

  15. [15]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    Jimenez, C., Lieret, K., Narasimhan, K., Press, O., Wettig, A., Yang, J., Yao, S.: SWE-agent: Agent-computer interfaces enable automated software engineer- ing. In: Advances in Neural Information Processing Systems 37, pp. 50528– 50652. Neural Information Processing Systems Foundation, Inc. (NeurIPS), Vancouver, BC, Canada (2024). https://doi.org/10.52202...

  16. [16]

    In: Workshop on Scaling Environ- ments for Agents (2025)

    Gandhi, S., Tsay, J., Ganhotra, J., Kate, K., Rizk, Y.: When agents go astray: Course-correcting SWE agents with PRMs. In: Workshop on Scaling Environ- ments for Agents (2025)

  17. [17]

    arXiv:2509.09853 (2025)

    Fan, Z., Vasilevski, K., Lin, D., Chen, B., Chen, Y., Zhong, Z., Zhang, J.M., He, P., Hassan, A.E.: SWE-Effi: Re-evaluating software AI agent system effectiveness under resource constraints. arXiv:2509.09853 (2025)

  18. [18]

    In: ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving (2026)

    Li, H., Mang, Q., He, R., Zhang, Q., Mao, H., Chen, X., Zhou, H., Cheung, A., Gonzalez, J.E., Stoica, I.: Continuum: Efficient and robust multi-turn LLM agent scheduling with KV cache time-to-live. In: ICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving (2026)

  19. [19]

    In: Proceedings of the 16th Workshop on Hot Topics in Operating Systems, pp

    Huang, P., Guo, C., Zhou, L., Lorch, J.R., Dang, Y., Chintalapati, M., Yao, R.: Gray failure: The Achilles’ heel of cloud-scale systems. In: Proceedings of the 16th Workshop on Hot Topics in Operating Systems, pp. 150–155. ACM, Whistler BC Canada (2017). https://doi.org/10.1145/3102980.3103005 Springer Nature 2021 LATEX template Projecting the Emerging Mi...

  20. [20]

    https://arxiv.org/abs/2406.10162

    Denison, C., MacDiarmid, M., Barez, F., Duvenaud, D., Kravec, S., Marks, S., Schiefer, N., Soklaski, R., Tamkin, A., Kaplan, J., Shlegeris, B., Bowman, S.R., Perez, E., Hubinger, E.: Sycophancy to subterfuge: Investigating reward- tampering in large language models (2024). https://arxiv.org/abs/2406.10162

  21. [21]

    arXiv:2603.09654 (2026)

    Augenstein, I.: Understanding the interplay between LLMs’ utilisation of para- metric and contextual knowledge: A keynote at ECIR 2025. arXiv:2603.09654 (2026)

  22. [22]

    https://arxiv.org/abs/2602.01011

    Pappu, A., El, B., Cao, H., di Nolfo, C., Sun, Y., Cao, M., Zou, J.: Multi-agent teams hold experts back (2026). https://arxiv.org/abs/2602.01011

  23. [23]

    https://doi.org/10.13140/RG.2.2.14475.96802

    Sartori, C.C.: The specification gap: Coordination failure under partial knowl- edge in code agents (2026). https://doi.org/10.13140/RG.2.2.14475.96802

  24. [24]

    Applied Sciences16(10), 4914 (2026)

    Maryanskyy, A., Budnikov, D., Kaliyev, A.T.: When agents disagree: The selec- tion bottleneck in multi-agent LLM pipelines. Applied Sciences16(10), 4914 (2026). https://doi.org/10.3390/app16104914

  25. [25]

    2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), 315–322 (2025)

    Barrak, A.: Traceability and accountability in role-specialized multi-agent LLM pipelines. 2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), 315–322 (2025). https://doi.org/10. 1109/ASEW67777.2025.00064

  26. [26]

    arXiv:2510.02837 (2025)

    Kim, W., Park, S., In, Y., Kim, S., Lee, D., Park, C.: Beyond the final answer: Eval- uating the reasoning trajectories of tool-augmented agents. arXiv:2510.02837 (2025)

  27. [27]

    In: The Thirteenth International Conference on Learning Representations (2024)

    Gautam, D., Garg, S., Jang, J., Sundaresan, N., Moghaddam, R.Z.: Refactor- Bench: Evaluating stateful reasoning in language agents through code. In: The Thirteenth International Conference on Learning Representations (2024)

  28. [28]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025)

    Yang, J., Lieret, K., Jimenez, C.E., Wettig, A., Khandpur, K., Zhang, Y., Hui, B., Press, O., Schmidt, L., Yang, D.: SWE-smith: Scaling data for software engineer- ing agents. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025)

  29. [29]

    https://arxiv.org/abs/2508.18993

    Ni, Z., Wang, H., Zhang, S., Lu, S., He, Z., You, W., Tang, Z., Du, Y., Sun, B., Liu, H., Hu, S., Chen, R., Li, B., Li, X., Hu, C., Jiao, B., Jiang, D., Lyu, P.: GitTaskBench: A benchmark for code agents solving real-world tasks through code repository leveraging (2025). https://arxiv.org/abs/2508.18993

  30. [30]

    arXiv:2504.08703 (2025) Springer Nature 2021 LATEX template 58Projecting the Emerging Mindset of SWE Agent

    Rashid, M.S., Bock, C., Zhuang, Y., Buchholz, A., Esler, T., Valentin, S., Franceschi, L., Wistuba, M., Sivaprasad, P.T., Kim, W.J., Deoras, A., Zappella, G., Callot, L.: SWE-PolyBench: A multi-language benchmark for repository-level evaluation of coding agents. arXiv:2504.08703 (2025) Springer Nature 2021 LATEX template 58Projecting the Emerging Mindset ...

  31. [31]

    In: Chiruzzo, L., Ritter, A., Wang, L

    Lu, J., Holleis, T., Zhang, Y., Aumayer, B., Nan, F., Bai, H., Ma, S., Ma, S., Li, M., Yin, G., Wang, Z., Pang, R.: ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) Findings of the Association for Computational Linguistics: NAACL 2025, pp. 1160–1183. Asso...

  32. [32]

    In: The Thirteenth Inter- national Conference on Learning Representations (2024)

    Yao, S., Shinn, N., Razavi, P., Narasimhan, K.R.:τ-bench: A benchmark for Tool-Agent-User interaction in real-world domains. In: The Thirteenth Inter- national Conference on Learning Representations (2024)

  33. [33]

    In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

    Wang, H., Huang, W., Wang, Y., Xi, Y., Lu, J., Zhang, H., Hu, N., Liu, Z., Pan, J.Z., Wong, K.-F.: Rethinking stateful tool use in multi-turn dialogues: Bench- marks and challenges. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025, pp. 5433–5453. Association for Computational ...

  34. [34]

    In: The Twelfth International Conference on Learn- ing Representations (2023)

    Liu, T., Xu, C., McAuley, J.: RepoBench: Benchmarking repository-level code auto-completion systems. In: The Twelfth International Conference on Learn- ing Representations (2023)

  35. [35]

    In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

    Ding, Y., Wang, Z., Ahmad, W., Ding, H., Tan, M., Jain, N., Ramanathan, M.K., Nallapati, R., Bhatia, P., Roth, D., Xiang, B.: CrossCodeEval: A diverse and mul- tilingual benchmark for cross-file code completion. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Infor- mation Processing Systems, vol. 36, pp...

  36. [36]

    Proceedings of the ACM on Software Engineering1(FSE), 675–698 (2024)

    Bairi, R., Sonwane, A., Kanade, A., C., V.D., Iyer, A., Parthasarathy, S., Raja- mani, S., Ashok, B., Shet, S.: CodePlan: Repository-level coding using LLMs and planning. Proceedings of the ACM on Software Engineering1(FSE), 675–698 (2024). https://doi.org/10.1145/3643757

  37. [37]

    LMMs-eval: Reality check on the evaluation of large multimodal models

    Du, J., Liu, Y., Guo, H., Wang, J., Huang, H., Ni, Y., Li, Z.: DependEval: Benchmarking LLMs for repository dependency understanding. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025, pp. 7150–7179. Association for Compu- tational Linguistics, Vienna, Austria (2025). https://d...

  38. [38]

    https://arxiv.org/abs/ 2509.14635

    Peng, W., Shi, Y., Wang, Y., Zhang, X., Shen, B., Gu, X.: SWE-QA: Can language models answer repository-level code questions? (2026). https://arxiv.org/abs/ 2509.14635

  39. [39]

    In: Proceedings of the 3rd ACM SIGSOFT Symposium on Foundations of Software Engineering, pp

    Murphy, G.C., Notkin, D., Sullivan, K.: Software reflexion models: Bridging the Springer Nature 2021 LATEX template Projecting the Emerging Mindset of SWE Agent59 gap between source and high-level models. In: Proceedings of the 3rd ACM SIGSOFT Symposium on Foundations of Software Engineering, pp. 18–28. ACM, Washington D.C. USA (1995). https://doi.org/10....

  40. [40]

    IEEE Transactions on Software Engineering35(4), 573–591 (2009)

    Ducasse, S., Pollet, D.: Software architecture reconstruction: A process- oriented taxonomy. IEEE Transactions on Software Engineering35(4), 573–591 (2009). https://doi.org/10.1109/TSE.2009.19

  41. [41]

    Technical report, Defense Technical Information Center, Fort Belvoir, V A (August 2000)

    Kazman, R., Klein, M., Clements, P.: ATAM: Method for architecture evalua- tion. Technical report, Defense Technical Information Center, Fort Belvoir, V A (August 2000). https://doi.org/10.21236/ADA382629

  42. [42]

    IEEE Software30(2), 38–45 (2013)

    Chen, L., Ali Babar, M., Nuseibeh, B.: Characterizing architecturally significant requirements. IEEE Software30(2), 38–45 (2013). https://doi.org/10.1109/MS. 2012.174

  43. [43]

    IEEE Transactions on Software Engineering29(3), 210–224 (2003)

    Eisenbarth, T., Koschke, R., Simon, D.: Locating features in source code. IEEE Transactions on Software Engineering29(3), 210–224 (2003). https://doi.org/ 10.1109/TSE.2003.1183929

  44. [44]

    IBM Systems Journal15(3), 182–211 (1976)

    Fagan, M.E.: Design and code inspections to reduce errors in program devel- opment. IBM Systems Journal15(3), 182–211 (1976). https://doi.org/10.1147/sj. 153.0182

  45. [45]

    In: 2013 35th International Conference on Software Engineering (ICSE), pp

    Bacchelli, A., Bird, C.: Expectations, outcomes, and challenges of modern code review. In: 2013 35th International Conference on Software Engineering (ICSE), pp. 712–721. IEEE, San Francisco, CA, USA (2013). https://doi.org/10.1109/ ICSE.2013.6606617