pith. machine review for the scientific record. sign in

arxiv: 2603.26567 · v2 · submitted 2026-03-27 · 💻 cs.SE · cs.AI

Recognition: no theorem link

Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:20 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords repository-level QALLM evaluationprogram comprehensionStackRepoQAretrieval-augmented generationmemorizationsoftware engineering benchmarksJava projects
0
0 comments X

The pith

LLMs achieve only moderate accuracy on real repository-level questions, frequently by reproducing Stack Overflow answers verbatim rather than reasoning through code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds StackRepoQA, a dataset of 1,318 actual developer questions and answers drawn from 134 open-source Java projects, to move beyond single-file benchmarks and test how LLMs handle questions that span entire repositories and their dependencies. The authors evaluate Claude 3.5 Sonnet and GPT-4o using direct prompting and agentic setups, then compare against retrieval methods that supply file contents or dependency graphs. Baseline results are moderate and rise modestly when structural information is added, yet overall performance stays low for repository-scale tasks. The study finds that many correct outputs match existing Stack Overflow answers exactly, indicating that success often comes from recall instead of comprehension of the code structure.

Core claim

Using the StackRepoQA dataset constructed from real developer questions and accepted answers across multiple Java projects, the evaluation shows that LLMs reach moderate accuracy on repository-level QA tasks. Performance improves when retrieval-augmented generation incorporates file-level signals and graph representations of structural dependencies, but accuracy remains limited overall. High scores frequently result from verbatim reproduction of Stack Overflow content rather than genuine reasoning about code across files.

What carries the argument

The StackRepoQA dataset of real multi-project questions paired with retrieval-augmented generation that adds file retrieval and graph-based dependency structures to LLM prompts.

If this is right

  • Adding file-level retrieval and dependency graphs produces only modest gains in accuracy for repository-scale questions.
  • Many high-performing responses trace directly to verbatim matches with existing Stack Overflow answers.
  • Repository-level comprehension stays limited even for current frontier models under both direct and agentic prompting.
  • Benchmarks must incorporate controls to separate memorization from reasoning in software engineering tasks.
  • Releasing the dataset supports development of new evaluation methods and augmentation techniques for multi-file code understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Paraphrased or novel questions without close training-data matches would likely expose even lower true reasoning performance.
  • The pattern of memorization may generalize to other languages and question types beyond the Java projects tested here.
  • Integrating execution traces or runtime verification alongside retrieval could better distinguish surface recall from functional understanding.
  • Future agent designs may need deeper code analysis modules rather than simple file or graph retrieval to reach reliable repository-scale performance.

Load-bearing premise

Exact string matches between model outputs and Stack Overflow answers reliably indicate memorization rather than any form of understanding or coincidental overlap.

What would settle it

Re-evaluate the same models on a version of the questions that have been paraphrased or rewritten so they have no direct Stack Overflow equivalents and check whether accuracy drops sharply.

Figures

Figures reproduced from arXiv: 2603.26567 by Chris Brown, Hunter Leary, Swanand Vaishampayan, Yoseph Berhanu Alebachew.

Figure 1
Figure 1. Figure 1: Overview of the data collection and preprocessing pipeline. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the multi-agent architecture used in our eval [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have shown impressive capabilities across software engineering tasks, including question answering (QA). However, most studies and benchmarks focus on isolated functions or single-file snippets, overlooking the challenges of real-world program comprehension, which often spans multiple files and system-level dependencies. In this work, we introduce StackRepoQA, the first multi-project, repository-level question answering dataset constructed from 1,318 real developer questions and accepted answers across 134 open-source Java projects. Using this dataset, we systematically evaluate two widely used LLMs (Claude 3.5 Sonnet and GPT-4o) under both direct prompting and agentic configurations. We compare baseline performance with retrieval-augmented generation methods that leverage file-level retrieval and graph-based representations of structural dependencies. Our results show that LLMs achieve moderate accuracy at baseline, with performance improving when structural signals are incorporated. Nonetheless, overall accuracy remains limited for repository-scale comprehension. The analysis reveals that high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning. To our knowledge, this is the first empirical study to provide such evidence in repository-level QA. We release StackRepoQA to encourage further research into benchmarks, evaluation protocols, and augmentation strategies that disentangle memorization from reasoning, advancing LLMs as reliable tool for repository-scale program comprehension.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces StackRepoQA, a new dataset of 1,318 real developer questions and accepted answers drawn from 134 open-source Java repositories. It evaluates Claude 3.5 Sonnet and GPT-4o under direct prompting, agentic setups, and retrieval-augmented generation that incorporates file-level retrieval and graph-based structural dependencies. The central claims are that baseline accuracy is moderate and improves modestly with structural signals, yet remains limited overall, and that high scores frequently arise from verbatim reproduction of Stack Overflow answers rather than genuine repository-level reasoning.

Significance. If the core findings hold after addressing the detection methodology, the work would be significant for software engineering and LLM evaluation communities. It supplies the first multi-project repository-level QA benchmark and supplies concrete evidence that current performance metrics may be inflated by memorization, which could motivate new protocols that better separate retrieval of memorized content from actual program comprehension. The public release of StackRepoQA is a clear positive contribution.

major comments (1)
  1. [Analysis] Analysis section: The headline claim that 'high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning' rests on the verbatim-detection procedure. The manuscript provides no description of the exact matching algorithm (exact string match, normalized edit distance, or embedding threshold), no controls for paraphrasing or partial semantic overlap (e.g., BLEU, ROUGE, or cosine similarity on sentence embeddings), and no inter-annotator agreement or human validation of the memorization labels. Because this distinction is load-bearing for the interpretation that overall accuracy reflects limited genuine comprehension, the current evidence does not securely support the conclusion.
minor comments (2)
  1. [Abstract] Abstract and §3: The description of how questions were filtered and how accuracy was computed (exact match, F1, or human judgment) is absent; adding a short paragraph on these operational definitions would improve reproducibility.
  2. [Experiments] §4 (Experiments): The paper should report the precise retrieval hyperparameters (top-k, embedding model, graph construction details) and statistical significance tests for the reported accuracy improvements under structural RAG.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential significance of StackRepoQA. We address the single major comment below and will revise the manuscript to strengthen the analysis.

read point-by-point responses
  1. Referee: Analysis section: The headline claim that 'high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning' rests on the verbatim-detection procedure. The manuscript provides no description of the exact matching algorithm (exact string match, normalized edit distance, or embedding threshold), no controls for paraphrasing or partial semantic overlap (e.g., BLEU, ROUGE, or cosine similarity on sentence embeddings), and no inter-annotator agreement or human validation of the memorization labels. Because this distinction is load-bearing for the interpretation that overall accuracy reflects limited genuine comprehension, the current evidence does not securely support the conclusion.

    Authors: We agree that the current manuscript lacks sufficient detail on the verbatim-detection procedure, which weakens support for the interpretation. In the revision we will add a dedicated subsection in the Analysis section that specifies the exact algorithm (normalized exact string match after lower-casing and whitespace removal, with a 90 % overlap threshold for partial matches), reports ROUGE-1/2/L and sentence-embedding cosine similarity as controls for paraphrasing, and includes human validation results on a random sample of 200 high-scoring cases together with inter-annotator agreement (Cohen’s kappa). These additions will make the evidence for the memorization claim transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper introduces a new dataset (StackRepoQA) constructed from 1,318 real developer questions across 134 Java projects and evaluates external LLMs (Claude 3.5 Sonnet, GPT-4o) under direct prompting, agentic setups, and RAG variants. No equations, fitted parameters, or self-citations appear in the derivation chain. The claim that high scores often result from verbatim Stack Overflow reproduction is an empirical observation on the newly collected data rather than a reduction of a prediction to a fitted input or self-defined quantity. The evaluation relies on external models and benchmarks, satisfying the self-contained criterion with no load-bearing step that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on the domain assumption that Stack Overflow accepted answers constitute ground truth for developer intent and that exact string overlap is a valid proxy for memorization. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Stack Overflow accepted answers represent reliable ground truth for repository-level questions
    Used to label the 1,318 questions as correct references
  • domain assumption Exact string match to training data indicates lack of genuine reasoning
    Central to the memorization analysis claim

pith-pipeline@v0.9.0 · 5551 in / 1285 out tokens · 22456 ms · 2026-05-14T22:20:20.777947+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 1 internal anchor

  1. [1]

    B., Ko, M., and Brown, C.Are we on the same page? examining developer perception alignment in open source code reviews

    Alebachew, Y. B., Ko, M., and Brown, C.Are we on the same page? examining developer perception alignment in open source code reviews. InProceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering(New York, NY, USA, 2025), EASE ’25, Association for Computing Machinery, p. 57–67

  2. [2]

    InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics(Vienna, Austria, July 2025), Association for Computational Linguistics, pp

    Amiraz, C., Cuconasu, F., Filice, S., and Karnin, Z.The distracting effect: Understanding irrelevant passages in RAG. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics(Vienna, Austria, July 2025), Association for Computational Linguistics, pp. 18228–18258

  3. [3]

    A framework for few-shot language model evaluation

    Anonymouse. Replication package for "beyond code snippets: Benchmarking llms on repository-level question answering". https://doi.org/10.5281/zenodo. 18276514, Jan. 2026. [4]Anthropic. How up-to-date is claude’s training data? Accessed: 2025-09-10

  4. [4]

    Arcuri, A., and Briand, L.A practical guide for using statistical tests to as- sess randomized algorithms in software engineering.Proceedings of the 33rd International Conference on Software Engineering (ICSE)(2011), 1–10

  5. [5]

    W., and Hassan, A

    Barua, A., Thomas, S. W., and Hassan, A. E.What are developers talking about? an analysis of topics and trends in stack overflow.Empirical Softw. Engg. 19, 3 (June 2014), 619–654

  6. [6]

    T.Understanding the factors that impact the popularity of github repositories

    Borges, H., Hora, A., and V alente, M. T.Understanding the factors that impact the popularity of github repositories. In2016 IEEE International Conference on Software Maintenance and Evolution (ICSME)(2016), pp. 334–344

  7. [7]

    B., Alebachew, Y

    Bose, D. B., Alebachew, Y. B., and Brown, C.Llms in debate: Does arguing make them better at detecting metamorphic relations? In2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW) (2025), pp. 43–50

  8. [8]

    Extracting training data from large language models

    Carlini, N., Tramèr, F., W allace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., and Raffel, C. Extracting training data from large language models. InProceedings of the 30th USENIX Security Symposium(2021), pp. 2633–2650

  9. [9]

    Chen, J., Zhao, K., Liu, J., Peng, C., Liu, J., Zhu, H., Gao, P., Y ang, P., and Deng, S.CoReQA: Uncovering Potentials of Language Models in Code Repository Question Answering, Jan. 2025. arXiv:2501.03447 [cs] version: 1

  10. [10]

    arXiv:2405.19782 [cs]

    Cheng, W., Wu, Y., and Hu, W.Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion, May 2024. arXiv:2405.19782 [cs]

  11. [11]

    Psychological Bulletin 114, 3 (1993), 494–509

    Cliff, N.Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin 114, 3 (1993), 494–509

  12. [12]

    Cohen, J.A coefficient of agreement for nominal scales.Educational and Psychological Measurement 20, 1 (1960), 37–46

  13. [13]

    Lawrence Erlbaum Associates, 1988

    Cohen, J.Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Lawrence Erlbaum Associates, 1988

  14. [14]

    M., and Reddy, S.FaithDial: A faithful benchmark for information-seeking dialogue.Transactions of the Association for Computational Linguistics 10(2022), 1473–1490

    Dziri, N., Kamalloo, E., Milton, S., Zaiane, O., Yu, M., Ponti, E. M., and Reddy, S.FaithDial: A faithful benchmark for information-seeking dialogue.Transactions of the Association for Computational Linguistics 10(2022), 1473–1490

  15. [15]

    Faul, F., Erdfelder, E., Buchner, A., and Lang, A.-G.Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses.Behavior Research Methods 41, 4 (2009), 1149–1160

  16. [16]

    Feng, Y., Papicchio, S., and Rahman, S.CypherBench: Towards precise retrieval over full-scale modern knowledge graphs in the LLM era. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(Vienna, Austria, July 2025), Association for Computational Linguistics, pp. 8934–8958

  17. [17]

    Hernandez, D., Brown, T., Conerly, T., DasSarma, N., Drain, D., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Henighan, T., Hume, T., Johnston, S., Mann, B., Olah, C., Olsson, C., Amodei, D., Joseph, N., Kaplan, J., and McCandlish, S.Scaling laws and interpretability of learning from repeated data, 2022

  18. [18]

    Hewing, M., and Leinhos, V.The Prompt Canvas: A Literature-Based Prac- titioner Guide for Creating Effective Prompts in Large Language Models, Dec

  19. [19]

    arXiv:2412.05127 [cs]

  20. [20]

    H., Hemmat, A., Naman, E., and Fatemi, A.Context awareness gate for retrieval augmented generation

    Heydari, M. H., Hemmat, A., Naman, E., and Fatemi, A.Context awareness gate for retrieval augmented generation. InProceedings of the 15th International Conference on Information and Knowledge Technology (IKT 2024)(2024)

  21. [21]

    Hu, R., Peng, C., Ren, J., Jiang, B., Meng, X., Wu, Q., Gao, P., W ang, X., and Gao, C.Understanding large language model performance in software engineering: A large-scale question answering benchmark. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (New York, NY, USA, 2025), SIGIR ’25, As...

  22. [22]

    Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi- Yu, J., Joulin, A., Riedel, S., and Grave, E.Atlas: few-shot learning with retrieval augmented language models.J. Mach. Learn. Res. 24, 1 (Jan. 2023). PROMISE 2026, 5 July, 2026, Montreal, Canada Alebachew et al

  23. [23]

    M., and Damian, D.The promises and perils of mining GitHub

    Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D. M., and Damian, D.The promises and perils of mining GitHub. InProceedings of the 11th Working Conference on Mining Software Repositories (MSR)(2014), ACM, pp. 92–101

  24. [24]

    Kavaler, D., and Filkov, V.Determinants of quality, latency, and amount of stack overflow answers about recent android apis.PLOS ONE 13, 3 (2018), e0194139

  25. [25]

    K., and Li, M

    Koo, T. K., and Li, M. Y.A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of Chiropractic Medicine 15, 2 (2016), 155–163

  26. [26]

    R., and Koch, G

    Landis, J. R., and Koch, G. G.The measurement of observer agreement for categorical data.Biometrics 33, 1 (1977), 159–174

  27. [27]

    D., and Myers, B

    LaToza, T. D., and Myers, B. A.On the importance of understanding the strategies that developers use. InProceedings of the 2010 ICSE Workshop on Cooperative and Human Aspects of Software Engineering(Cape Town South Africa, May 2010), ACM, pp. 72–75

  28. [28]

    InProceedings of the 34th International Conference on Neural Information Processing Systems(Red Hook, NY, USA, 2020), NIPS ’20, Curran Associates Inc

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D.Retrieval- augmented generation for knowledge-intensive nlp tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems(Red Hook, NY, USA, 2020), NIPS ’20, Curran Asso...

  29. [29]

    T., Y ang, C., and Myers, B

    Liang, J. T., Y ang, C., and Myers, B. A.A large-scale survey on the usability of ai programming assistants: Successes and challenges. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(New York, NY, USA, 2024), ICSE ’24, Association for Computing Machinery

  30. [30]

    InText Sum- marization Branches Out, Proceedings of the ACL-04 Workshop(2004), Association for Computational Linguistics, pp

    Lin, C.-Y.Rouge: A package for automatic evaluation of summaries. InText Sum- marization Branches Out, Proceedings of the ACL-04 Workshop(2004), Association for Computational Linguistics, pp. 74–81

  31. [31]

    Y., Liu, C., Gao, H., Thongtanunam, P., and Treude, C.CodeReviewQA: The code review comprehension assessment for large language models

    Lin, H. Y., Liu, C., Gao, H., Thongtanunam, P., and Treude, C.CodeReviewQA: The code review comprehension assessment for large language models. In Findings of the Association for Computational Linguistics: ACL 2025(Vienna, Austria, July 2025), Association for Computational Linguistics, pp. 9138–9166

  32. [32]

    Lin, S., Hilton, J., and Evans, O.Truthfulqa: Measuring how models mimic human falsehoods, 2022

  33. [33]

    Liu, J., W ang, K., Chen, Y., Peng, X., Chen, Z., Zhang, L., and Lou, Y.Large Language Model-Based Agents for Software Engineering: A Survey, Sept. 2024. arXiv:2409.02977 [cs]

  34. [34]

    Q., and Zhou, W.CodexGraph: Bridging large language models and code repositories via code graph databases

    Liu, X., Lan, B., Hu, Z., Liu, Y., Zhang, Z., W ang, F., Shieh, M. Q., and Zhou, W.CodexGraph: Bridging large language models and code repositories via code graph databases. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)(Albuq...

  35. [35]

    InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing(Singapore, Dec

    Liu, Y., Iter, D., Xu, Y., W ang, S., Xu, R., and Zhu, C.G-eval: NLG evaluation using gpt-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing(Singapore, Dec. 2023), H. Bouamor, J. Pino, and K. Bali, Eds., Association for Computational Linguistics, pp. 2511–2522

  36. [36]

    In2020 International Conference on Big Data and Social Sciences (ICBDSS)(2020), pp

    Lu, D., Wu, J., Sheng, Y., Liu, P., and Y ang, M.Analysis of the popularity of pro- gramming languages in open source software communities. In2020 International Conference on Big Data and Social Sciences (ICBDSS)(2020), pp. 111–114

  37. [37]

    InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)(2023)

    Luo, H., Yu, H., Liu, Z., Tan, Y., Sun, Y., Qiu, X., and Jiang, Z.Improving repository-level code question answering with retrieval-augmented generation. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)(2023)

  38. [38]

    Mao, Y., He, J., and Chen, C.From prompts to templates: A systematic prompt template analysis for real-world llmapps. InProceedings of the 33rd ACM Interna- tional Conference on the Foundations of Software Engineering(New York, NY, USA, 2025), FSE Companion ’25, Association for Computing Machinery, p. 75–86

  39. [39]

    P., Radermacher, A., Rasche, R., and Knoll, A.Querying large automotive software models: Agentic vs

    Mazur, L., Petrovic, N., Miranda, J. P., Radermacher, A., Rasche, R., and Knoll, A.Querying large automotive software models: Agentic vs. direct llm approaches. In2025 2nd International Generative AI and Computational Language Modelling Conference (GACLM)(2025), pp. 221–228

  40. [40]

    Navarro, G.A guided tour to approximate string matching.ACM computing surveys (CSUR) 33, 1 (2001), 31–88

  41. [41]

    InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing(2017), ACL, pp

    Novikova, J., Dušek, O., and Rieser, V.Why we need new evaluation metrics for nlg. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing(2017), ACL, pp. 2241–2252. [42]OpenAI. Model - openai api, 2025. Accessed: 2025-09-10

  42. [42]

    InProceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002)(2002), Association for Computational Linguistics, pp

    Papineni, K., Roukos, S., W ard, T., and Zhu, W.-J.Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002)(2002), Association for Computational Linguistics, pp. 311–318

  43. [43]

    N., Phan, H

    Phan, H. N., Phan, H. N., Nguyen, T. N., and Bui, N. D. Q.Repohyper: Search- expand-refine on semantic graphs for repository-level code completion. In2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge)(2025), IEEE Press, p. 14–25

  44. [44]

    arXiv:2505.07897 [cs] version: 1

    Rando, S., Romani, L., Sampieri, A., Kyuragi, Y., Franco, L., Galasso, F., Hashimoto, T., and Y ang, J.LongCodeBench: Evaluating Coding LLMs at 1M Context Windows, May 2025. arXiv:2505.07897 [cs] version: 1

  45. [45]

    InProceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP 2019)(2019), Association for Compu- tational Linguistics, pp

    Reimers, N., and Gurevych, I.Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP 2019)(2019), Association for Compu- tational Linguistics, pp. 3982–3992

  46. [46]

    ICLR Workshop on Scaling Self-Improving Foundation Models without Human Supervision (Oral)(2025)

    Robeyns, M., Szummer, M., and Aitchison, L.A self-improving coding agent. ICLR Workshop on Scaling Self-Improving Foundation Models without Human Supervision (Oral)(2025)

  47. [47]

    Roehm, T., Tiarks, R., Koschke, R., and Maalej, W.How do professional developers comprehend software? In2012 34th International Conference on Software Engineering (ICSE)(June 2012), pp. 255–265. Done

  48. [48]

    S., and Wilk, M

    Shapiro, S. S., and Wilk, M. B.An analysis of variance test for normality (complete samples).Biometrika 52, 3-4 (1965), 591–611

  49. [49]

    E., and Fleiss, J

    Shrout, P. E., and Fleiss, J. L.Intraclass correlations: Uses in assessing rater reliability.Psychological Bulletin 86, 2 (1979), 420–428

  50. [50]

    Downloaded via personal account; accessible at https://archive.org/details/stackexchange, April 2025

    Stack Exchange Inc.Stack Overflow Data Dump. Downloaded via personal account; accessible at https://archive.org/details/stackexchange, April 2025. Ac- cessed on 2025-06-09

  51. [51]

    Stack overflow developer survey 2025: Usage of large language models

    Stack Overflow. Stack overflow developer survey 2025: Usage of large language models. https://survey.stackoverflow.co/2025/, 2025. OpenAI’s GPT models used by 81%, Claude Sonnet by 43% of developers [52]

  52. [52]

    InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)(Bangkok, Thailand, Aug

    Strich, J., Schneider, F., Nikishina, I., and Biemann, C.On Improving Repository-Level Code QA for Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)(Bangkok, Thailand, Aug. 2024), X. Fu and E. Fleisig, Eds., Association for Computational Linguistics, pp. 209–244

  53. [53]

    InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing(Miami, Florida, USA, Nov

    Tong, W., and Zhang, T.CodeJudge: Evaluating code generation with large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing(Miami, Florida, USA, Nov. 2024), Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, Eds., Association for Computational Linguistics, pp. 20032–20051

  54. [54]

    InProceedings of the 33rd Interna- tional Conference on Software Engineering(New York, NY, USA, 2011), ICSE ’11, Association for Computing Machinery, p

    Treude, C., Barzilay, O., and Storey, M.-A.How do programmers ask and answer questions on the web? (nier track). InProceedings of the 33rd Interna- tional Conference on Software Engineering(New York, NY, USA, 2011), ICSE ’11, Association for Computing Machinery, p. 804–807

  55. [55]

    S., De Almeida, F

    V aillant, T. S., De Almeida, F. D., Silveira Neto, P. A. M., Gao, C., Bosch, J., and Santana de Almeida, E.Developers’ perceptions on the impact of chatgpt in software development: A survey.CoRR abs/2405.12195(2024)

  56. [56]

    Y., and Xia, X.Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering

    W ang, R., Guo, J., Gao, C., Fan, G., Chong, C. Y., and Xia, X.Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering. Proc. ACM Softw. Eng. 2, ISSTA (June 2025)

  57. [57]

    Emergent Abilities of Large Language Models

    Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W.Emergent Abilities of Large Language Models, Oct. 2022. arXiv:2206.07682 [cs]

  58. [58]

    E., and Li, S.Measuring Program Comprehension: A Large-Scale Field Study with Professionals.IEEE Transactions on Software Engineering 44, 10 (Oct

    Xia, X., Bao, L., Lo, D., Xing, Z., Hassan, A. E., and Li, S.Measuring Program Comprehension: A Large-Scale Field Study with Professionals.IEEE Transactions on Software Engineering 44, 10 (Oct. 2018), 951–976. Conference Name: IEEE Transactions on Software Engineering

  59. [59]

    F., Alon, U., Neubig, G., and Hellendoorn, V

    Xu, F. F., Alon, U., Neubig, G., and Hellendoorn, V. J.A Systematic Evaluation of Large Language Models of Code, May 2022. arXiv:2202.13169 [cs]

  60. [60]

    Yin, Y., Ma, L., Gong, Y., Shi, Y., W ahab, F., and Zhao, Y.Deep semantics- enhanced neural code search.Electronics 13, 23 (2024)

  61. [61]

    InThe Twelfth International Con- ference on Learning Representations(2024)

    Yoran, O., Wolfson, T., Ram, O., and Berant, J.Making retrieval-augmented language models robust to irrelevant context. InThe Twelfth International Con- ference on Learning Representations(2024)

  62. [62]

    InProceedings of the International Conference on Software Engineering and Knowledge Engineering (SEKE)(2023)

    Zhang, B., Liang, P., Zhou, X., Ahmad, A., and W aseem, M.Practices and challenges of using github copilot: An empirical study. InProceedings of the International Conference on Software Engineering and Knowledge Engineering (SEKE)(2023)

  63. [63]

    E.Reading answers on stack overflow: Not enough!IEEE Transactions on Software Engineering 47, 11 (2021), 2520–2533

    Zhang, H., W ang, S., Chen, T.-H., and Hassan, A. E.Reading answers on stack overflow: Not enough!IEEE Transactions on Software Engineering 47, 11 (2021), 2520–2533

  64. [64]

    E.An empirical study of obsolete answers on stack overflow.IEEE Transactions on Software Engineering 47, 4 (2021), 850–862

    Zhang, H., W ang, S., Chen, T.-H., Zou, Y., and Hassan, A. E.An empirical study of obsolete answers on stack overflow.IEEE Transactions on Software Engineering 47, 4 (2021), 850–862

  65. [65]

    Unifying the perspectives of NLP and software engineering: A survey on language models for code.Trans

    Zhang, Z., Chen, C., Liu, B., Liao, C., Gong, Z., Yu, H., Li, J., and W ang, R. Unifying the perspectives of NLP and software engineering: A survey on language models for code.Trans. Mach. Learn. Res. 2024(2024)

  66. [66]

    P., Zhang, H., Gonzalez, J

    Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I.Judging llm-as- a-judge with mt-bench and chatbot arena. InProceedings of the 37th International Conference on Neural Information Processing Systems(Red Hook, NY, USA, 2023), NIPS ’23, Curran Associates Inc