arxiv: 2603.26567 · v2 · submitted 2026-03-27 · 💻 cs.SE · cs.AI

Recognition: no theorem link

Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

Yoseph Berhanu Alebachew , Hunter Leary , Swanand Vaishampayan , Chris Brown

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:20 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords repository-level QALLM evaluationprogram comprehensionStackRepoQAretrieval-augmented generationmemorizationsoftware engineering benchmarksJava projects

0 comments

The pith

LLMs achieve only moderate accuracy on real repository-level questions, frequently by reproducing Stack Overflow answers verbatim rather than reasoning through code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds StackRepoQA, a dataset of 1,318 actual developer questions and answers drawn from 134 open-source Java projects, to move beyond single-file benchmarks and test how LLMs handle questions that span entire repositories and their dependencies. The authors evaluate Claude 3.5 Sonnet and GPT-4o using direct prompting and agentic setups, then compare against retrieval methods that supply file contents or dependency graphs. Baseline results are moderate and rise modestly when structural information is added, yet overall performance stays low for repository-scale tasks. The study finds that many correct outputs match existing Stack Overflow answers exactly, indicating that success often comes from recall instead of comprehension of the code structure.

Core claim

Using the StackRepoQA dataset constructed from real developer questions and accepted answers across multiple Java projects, the evaluation shows that LLMs reach moderate accuracy on repository-level QA tasks. Performance improves when retrieval-augmented generation incorporates file-level signals and graph representations of structural dependencies, but accuracy remains limited overall. High scores frequently result from verbatim reproduction of Stack Overflow content rather than genuine reasoning about code across files.

What carries the argument

The StackRepoQA dataset of real multi-project questions paired with retrieval-augmented generation that adds file retrieval and graph-based dependency structures to LLM prompts.

If this is right

Adding file-level retrieval and dependency graphs produces only modest gains in accuracy for repository-scale questions.
Many high-performing responses trace directly to verbatim matches with existing Stack Overflow answers.
Repository-level comprehension stays limited even for current frontier models under both direct and agentic prompting.
Benchmarks must incorporate controls to separate memorization from reasoning in software engineering tasks.
Releasing the dataset supports development of new evaluation methods and augmentation techniques for multi-file code understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Paraphrased or novel questions without close training-data matches would likely expose even lower true reasoning performance.
The pattern of memorization may generalize to other languages and question types beyond the Java projects tested here.
Integrating execution traces or runtime verification alongside retrieval could better distinguish surface recall from functional understanding.
Future agent designs may need deeper code analysis modules rather than simple file or graph retrieval to reach reliable repository-scale performance.

Load-bearing premise

Exact string matches between model outputs and Stack Overflow answers reliably indicate memorization rather than any form of understanding or coincidental overlap.

What would settle it

Re-evaluate the same models on a version of the questions that have been paraphrased or rewritten so they have no direct Stack Overflow equivalents and check whether accuracy drops sharply.

Figures

Figures reproduced from arXiv: 2603.26567 by Chris Brown, Hunter Leary, Swanand Vaishampayan, Yoseph Berhanu Alebachew.

**Figure 2.** Figure 2: Overview of the multi-agent architecture used in our eval [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have shown impressive capabilities across software engineering tasks, including question answering (QA). However, most studies and benchmarks focus on isolated functions or single-file snippets, overlooking the challenges of real-world program comprehension, which often spans multiple files and system-level dependencies. In this work, we introduce StackRepoQA, the first multi-project, repository-level question answering dataset constructed from 1,318 real developer questions and accepted answers across 134 open-source Java projects. Using this dataset, we systematically evaluate two widely used LLMs (Claude 3.5 Sonnet and GPT-4o) under both direct prompting and agentic configurations. We compare baseline performance with retrieval-augmented generation methods that leverage file-level retrieval and graph-based representations of structural dependencies. Our results show that LLMs achieve moderate accuracy at baseline, with performance improving when structural signals are incorporated. Nonetheless, overall accuracy remains limited for repository-scale comprehension. The analysis reveals that high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning. To our knowledge, this is the first empirical study to provide such evidence in repository-level QA. We release StackRepoQA to encourage further research into benchmarks, evaluation protocols, and augmentation strategies that disentangle memorization from reasoning, advancing LLMs as reliable tool for repository-scale program comprehension.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StackRepoQA gives a real-world repo-level QA dataset from developer questions, but the claim that high scores come from verbatim SO copying rather than reasoning rests on exact-match detection without clear controls for paraphrase or partial overlap.

read the letter

The paper's main value is the StackRepoQA dataset: 1,318 questions pulled from real developer posts across 134 Java projects, paired with accepted answers. That moves past the usual single-function benchmarks and lets people test whether models can handle cross-file dependencies and system context. They run Claude 3.5 Sonnet and GPT-4o in direct, agentic, and RAG setups (file retrieval plus graph structure), and the numbers show moderate baseline accuracy that rises a bit with structural signals but stays limited overall. Releasing the data is the concrete step forward here; anyone building or evaluating code models can use it directly.

Referee Report

1 major / 2 minor

Summary. The paper introduces StackRepoQA, a new dataset of 1,318 real developer questions and accepted answers drawn from 134 open-source Java repositories. It evaluates Claude 3.5 Sonnet and GPT-4o under direct prompting, agentic setups, and retrieval-augmented generation that incorporates file-level retrieval and graph-based structural dependencies. The central claims are that baseline accuracy is moderate and improves modestly with structural signals, yet remains limited overall, and that high scores frequently arise from verbatim reproduction of Stack Overflow answers rather than genuine repository-level reasoning.

Significance. If the core findings hold after addressing the detection methodology, the work would be significant for software engineering and LLM evaluation communities. It supplies the first multi-project repository-level QA benchmark and supplies concrete evidence that current performance metrics may be inflated by memorization, which could motivate new protocols that better separate retrieval of memorized content from actual program comprehension. The public release of StackRepoQA is a clear positive contribution.

major comments (1)

[Analysis] Analysis section: The headline claim that 'high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning' rests on the verbatim-detection procedure. The manuscript provides no description of the exact matching algorithm (exact string match, normalized edit distance, or embedding threshold), no controls for paraphrasing or partial semantic overlap (e.g., BLEU, ROUGE, or cosine similarity on sentence embeddings), and no inter-annotator agreement or human validation of the memorization labels. Because this distinction is load-bearing for the interpretation that overall accuracy reflects limited genuine comprehension, the current evidence does not securely support the conclusion.

minor comments (2)

[Abstract] Abstract and §3: The description of how questions were filtered and how accuracy was computed (exact match, F1, or human judgment) is absent; adding a short paragraph on these operational definitions would improve reproducibility.
[Experiments] §4 (Experiments): The paper should report the precise retrieval hyperparameters (top-k, embedding model, graph construction details) and statistical significance tests for the reported accuracy improvements under structural RAG.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential significance of StackRepoQA. We address the single major comment below and will revise the manuscript to strengthen the analysis.

read point-by-point responses

Referee: Analysis section: The headline claim that 'high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning' rests on the verbatim-detection procedure. The manuscript provides no description of the exact matching algorithm (exact string match, normalized edit distance, or embedding threshold), no controls for paraphrasing or partial semantic overlap (e.g., BLEU, ROUGE, or cosine similarity on sentence embeddings), and no inter-annotator agreement or human validation of the memorization labels. Because this distinction is load-bearing for the interpretation that overall accuracy reflects limited genuine comprehension, the current evidence does not securely support the conclusion.

Authors: We agree that the current manuscript lacks sufficient detail on the verbatim-detection procedure, which weakens support for the interpretation. In the revision we will add a dedicated subsection in the Analysis section that specifies the exact algorithm (normalized exact string match after lower-casing and whitespace removal, with a 90 % overlap threshold for partial matches), reports ROUGE-1/2/L and sentence-embedding cosine similarity as controls for paraphrasing, and includes human validation results on a random sample of 200 high-scoring cases together with inter-annotator agreement (Cohen’s kappa). These additions will make the evidence for the memorization claim transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper introduces a new dataset (StackRepoQA) constructed from 1,318 real developer questions across 134 Java projects and evaluates external LLMs (Claude 3.5 Sonnet, GPT-4o) under direct prompting, agentic setups, and RAG variants. No equations, fitted parameters, or self-citations appear in the derivation chain. The claim that high scores often result from verbatim Stack Overflow reproduction is an empirical observation on the newly collected data rather than a reduction of a prediction to a fitted input or self-defined quantity. The evaluation relies on external models and benchmarks, satisfying the self-contained criterion with no load-bearing step that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on the domain assumption that Stack Overflow accepted answers constitute ground truth for developer intent and that exact string overlap is a valid proxy for memorization. No free parameters or invented entities are introduced.

axioms (2)

domain assumption Stack Overflow accepted answers represent reliable ground truth for repository-level questions
Used to label the 1,318 questions as correct references
domain assumption Exact string match to training data indicates lack of genuine reasoning
Central to the memorization analysis claim

pith-pipeline@v0.9.0 · 5551 in / 1285 out tokens · 22456 ms · 2026-05-14T22:20:20.777947+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 1 internal anchor

[1]

B., Ko, M., and Brown, C.Are we on the same page? examining developer perception alignment in open source code reviews

Alebachew, Y. B., Ko, M., and Brown, C.Are we on the same page? examining developer perception alignment in open source code reviews. InProceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering(New York, NY, USA, 2025), EASE ’25, Association for Computing Machinery, p. 57–67

work page 2025
[2]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics(Vienna, Austria, July 2025), Association for Computational Linguistics, pp

Amiraz, C., Cuconasu, F., Filice, S., and Karnin, Z.The distracting effect: Understanding irrelevant passages in RAG. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics(Vienna, Austria, July 2025), Association for Computational Linguistics, pp. 18228–18258

work page 2025
[3]

A framework for few-shot language model evaluation

Anonymouse. Replication package for "beyond code snippets: Benchmarking llms on repository-level question answering". https://doi.org/10.5281/zenodo. 18276514, Jan. 2026. [4]Anthropic. How up-to-date is claude’s training data? Accessed: 2025-09-10

work page doi:10.5281/zenodo 2026
[4]

Arcuri, A., and Briand, L.A practical guide for using statistical tests to as- sess randomized algorithms in software engineering.Proceedings of the 33rd International Conference on Software Engineering (ICSE)(2011), 1–10

work page 2011
[5]

W., and Hassan, A

Barua, A., Thomas, S. W., and Hassan, A. E.What are developers talking about? an analysis of topics and trends in stack overflow.Empirical Softw. Engg. 19, 3 (June 2014), 619–654

work page 2014
[6]

T.Understanding the factors that impact the popularity of github repositories

Borges, H., Hora, A., and V alente, M. T.Understanding the factors that impact the popularity of github repositories. In2016 IEEE International Conference on Software Maintenance and Evolution (ICSME)(2016), pp. 334–344

work page 2016
[7]

B., Alebachew, Y

Bose, D. B., Alebachew, Y. B., and Brown, C.Llms in debate: Does arguing make them better at detecting metamorphic relations? In2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW) (2025), pp. 43–50

work page 2025
[8]

Extracting training data from large language models

Carlini, N., Tramèr, F., W allace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., and Raffel, C. Extracting training data from large language models. InProceedings of the 30th USENIX Security Symposium(2021), pp. 2633–2650

work page 2021
[9]

Chen, J., Zhao, K., Liu, J., Peng, C., Liu, J., Zhu, H., Gao, P., Y ang, P., and Deng, S.CoReQA: Uncovering Potentials of Language Models in Code Repository Question Answering, Jan. 2025. arXiv:2501.03447 [cs] version: 1

work page arXiv 2025
[10]

arXiv:2405.19782 [cs]

Cheng, W., Wu, Y., and Hu, W.Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion, May 2024. arXiv:2405.19782 [cs]

work page arXiv 2024
[11]

Psychological Bulletin 114, 3 (1993), 494–509

Cliff, N.Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin 114, 3 (1993), 494–509

work page 1993
[12]

Cohen, J.A coefficient of agreement for nominal scales.Educational and Psychological Measurement 20, 1 (1960), 37–46

work page 1960
[13]

Lawrence Erlbaum Associates, 1988

Cohen, J.Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Lawrence Erlbaum Associates, 1988

work page 1988
[14]

M., and Reddy, S.FaithDial: A faithful benchmark for information-seeking dialogue.Transactions of the Association for Computational Linguistics 10(2022), 1473–1490

Dziri, N., Kamalloo, E., Milton, S., Zaiane, O., Yu, M., Ponti, E. M., and Reddy, S.FaithDial: A faithful benchmark for information-seeking dialogue.Transactions of the Association for Computational Linguistics 10(2022), 1473–1490

work page 2022
[15]

Faul, F., Erdfelder, E., Buchner, A., and Lang, A.-G.Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses.Behavior Research Methods 41, 4 (2009), 1149–1160

work page 2009
[16]

Feng, Y., Papicchio, S., and Rahman, S.CypherBench: Towards precise retrieval over full-scale modern knowledge graphs in the LLM era. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(Vienna, Austria, July 2025), Association for Computational Linguistics, pp. 8934–8958

work page 2025
[17]

Hernandez, D., Brown, T., Conerly, T., DasSarma, N., Drain, D., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Henighan, T., Hume, T., Johnston, S., Mann, B., Olah, C., Olsson, C., Amodei, D., Joseph, N., Kaplan, J., and McCandlish, S.Scaling laws and interpretability of learning from repeated data, 2022

work page 2022
[18]

Hewing, M., and Leinhos, V.The Prompt Canvas: A Literature-Based Prac- titioner Guide for Creating Effective Prompts in Large Language Models, Dec

work page
[19]

arXiv:2412.05127 [cs]

work page arXiv
[20]

H., Hemmat, A., Naman, E., and Fatemi, A.Context awareness gate for retrieval augmented generation

Heydari, M. H., Hemmat, A., Naman, E., and Fatemi, A.Context awareness gate for retrieval augmented generation. InProceedings of the 15th International Conference on Information and Knowledge Technology (IKT 2024)(2024)

work page 2024
[21]

Hu, R., Peng, C., Ren, J., Jiang, B., Meng, X., Wu, Q., Gao, P., W ang, X., and Gao, C.Understanding large language model performance in software engineering: A large-scale question answering benchmark. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (New York, NY, USA, 2025), SIGIR ’25, As...

work page 2025
[22]

Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi- Yu, J., Joulin, A., Riedel, S., and Grave, E.Atlas: few-shot learning with retrieval augmented language models.J. Mach. Learn. Res. 24, 1 (Jan. 2023). PROMISE 2026, 5 July, 2026, Montreal, Canada Alebachew et al

work page 2023
[23]

M., and Damian, D.The promises and perils of mining GitHub

Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D. M., and Damian, D.The promises and perils of mining GitHub. InProceedings of the 11th Working Conference on Mining Software Repositories (MSR)(2014), ACM, pp. 92–101

work page 2014
[24]

Kavaler, D., and Filkov, V.Determinants of quality, latency, and amount of stack overflow answers about recent android apis.PLOS ONE 13, 3 (2018), e0194139

work page 2018
[25]

K., and Li, M

Koo, T. K., and Li, M. Y.A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of Chiropractic Medicine 15, 2 (2016), 155–163

work page 2016
[26]

R., and Koch, G

Landis, J. R., and Koch, G. G.The measurement of observer agreement for categorical data.Biometrics 33, 1 (1977), 159–174

work page 1977
[27]

D., and Myers, B

LaToza, T. D., and Myers, B. A.On the importance of understanding the strategies that developers use. InProceedings of the 2010 ICSE Workshop on Cooperative and Human Aspects of Software Engineering(Cape Town South Africa, May 2010), ACM, pp. 72–75

work page 2010
[28]

InProceedings of the 34th International Conference on Neural Information Processing Systems(Red Hook, NY, USA, 2020), NIPS ’20, Curran Associates Inc

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D.Retrieval- augmented generation for knowledge-intensive nlp tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems(Red Hook, NY, USA, 2020), NIPS ’20, Curran Asso...

work page 2020
[29]

T., Y ang, C., and Myers, B

Liang, J. T., Y ang, C., and Myers, B. A.A large-scale survey on the usability of ai programming assistants: Successes and challenges. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(New York, NY, USA, 2024), ICSE ’24, Association for Computing Machinery

work page 2024
[30]

InText Sum- marization Branches Out, Proceedings of the ACL-04 Workshop(2004), Association for Computational Linguistics, pp

Lin, C.-Y.Rouge: A package for automatic evaluation of summaries. InText Sum- marization Branches Out, Proceedings of the ACL-04 Workshop(2004), Association for Computational Linguistics, pp. 74–81

work page 2004
[31]

Y., Liu, C., Gao, H., Thongtanunam, P., and Treude, C.CodeReviewQA: The code review comprehension assessment for large language models

Lin, H. Y., Liu, C., Gao, H., Thongtanunam, P., and Treude, C.CodeReviewQA: The code review comprehension assessment for large language models. In Findings of the Association for Computational Linguistics: ACL 2025(Vienna, Austria, July 2025), Association for Computational Linguistics, pp. 9138–9166

work page 2025
[32]

Lin, S., Hilton, J., and Evans, O.Truthfulqa: Measuring how models mimic human falsehoods, 2022

work page 2022
[33]

Liu, J., W ang, K., Chen, Y., Peng, X., Chen, Z., Zhang, L., and Lou, Y.Large Language Model-Based Agents for Software Engineering: A Survey, Sept. 2024. arXiv:2409.02977 [cs]

work page arXiv 2024
[34]

Q., and Zhou, W.CodexGraph: Bridging large language models and code repositories via code graph databases

Liu, X., Lan, B., Hu, Z., Liu, Y., Zhang, Z., W ang, F., Shieh, M. Q., and Zhou, W.CodexGraph: Bridging large language models and code repositories via code graph databases. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)(Albuq...

work page 2025
[35]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing(Singapore, Dec

Liu, Y., Iter, D., Xu, Y., W ang, S., Xu, R., and Zhu, C.G-eval: NLG evaluation using gpt-4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing(Singapore, Dec. 2023), H. Bouamor, J. Pino, and K. Bali, Eds., Association for Computational Linguistics, pp. 2511–2522

work page 2023
[36]

In2020 International Conference on Big Data and Social Sciences (ICBDSS)(2020), pp

Lu, D., Wu, J., Sheng, Y., Liu, P., and Y ang, M.Analysis of the popularity of pro- gramming languages in open source software communities. In2020 International Conference on Big Data and Social Sciences (ICBDSS)(2020), pp. 111–114

work page 2020
[37]

InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)(2023)

Luo, H., Yu, H., Liu, Z., Tan, Y., Sun, Y., Qiu, X., and Jiang, Z.Improving repository-level code question answering with retrieval-augmented generation. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)(2023)

work page 2023
[38]

Mao, Y., He, J., and Chen, C.From prompts to templates: A systematic prompt template analysis for real-world llmapps. InProceedings of the 33rd ACM Interna- tional Conference on the Foundations of Software Engineering(New York, NY, USA, 2025), FSE Companion ’25, Association for Computing Machinery, p. 75–86

work page 2025
[39]

P., Radermacher, A., Rasche, R., and Knoll, A.Querying large automotive software models: Agentic vs

Mazur, L., Petrovic, N., Miranda, J. P., Radermacher, A., Rasche, R., and Knoll, A.Querying large automotive software models: Agentic vs. direct llm approaches. In2025 2nd International Generative AI and Computational Language Modelling Conference (GACLM)(2025), pp. 221–228

work page 2025
[40]

Navarro, G.A guided tour to approximate string matching.ACM computing surveys (CSUR) 33, 1 (2001), 31–88

work page 2001
[41]

InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing(2017), ACL, pp

Novikova, J., Dušek, O., and Rieser, V.Why we need new evaluation metrics for nlg. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing(2017), ACL, pp. 2241–2252. [42]OpenAI. Model - openai api, 2025. Accessed: 2025-09-10

work page 2017
[42]

InProceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002)(2002), Association for Computational Linguistics, pp

Papineni, K., Roukos, S., W ard, T., and Zhu, W.-J.Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002)(2002), Association for Computational Linguistics, pp. 311–318

work page 2002
[43]

N., Phan, H

Phan, H. N., Phan, H. N., Nguyen, T. N., and Bui, N. D. Q.Repohyper: Search- expand-refine on semantic graphs for repository-level code completion. In2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge)(2025), IEEE Press, p. 14–25

work page 2025
[44]

arXiv:2505.07897 [cs] version: 1

Rando, S., Romani, L., Sampieri, A., Kyuragi, Y., Franco, L., Galasso, F., Hashimoto, T., and Y ang, J.LongCodeBench: Evaluating Coding LLMs at 1M Context Windows, May 2025. arXiv:2505.07897 [cs] version: 1

work page arXiv 2025
[45]

InProceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP 2019)(2019), Association for Compu- tational Linguistics, pp

Reimers, N., and Gurevych, I.Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP 2019)(2019), Association for Compu- tational Linguistics, pp. 3982–3992

work page 2019
[46]

ICLR Workshop on Scaling Self-Improving Foundation Models without Human Supervision (Oral)(2025)

Robeyns, M., Szummer, M., and Aitchison, L.A self-improving coding agent. ICLR Workshop on Scaling Self-Improving Foundation Models without Human Supervision (Oral)(2025)

work page 2025
[47]

Roehm, T., Tiarks, R., Koschke, R., and Maalej, W.How do professional developers comprehend software? In2012 34th International Conference on Software Engineering (ICSE)(June 2012), pp. 255–265. Done

work page 2012
[48]

S., and Wilk, M

Shapiro, S. S., and Wilk, M. B.An analysis of variance test for normality (complete samples).Biometrika 52, 3-4 (1965), 591–611

work page 1965
[49]

E., and Fleiss, J

Shrout, P. E., and Fleiss, J. L.Intraclass correlations: Uses in assessing rater reliability.Psychological Bulletin 86, 2 (1979), 420–428

work page 1979
[50]

Downloaded via personal account; accessible at https://archive.org/details/stackexchange, April 2025

Stack Exchange Inc.Stack Overflow Data Dump. Downloaded via personal account; accessible at https://archive.org/details/stackexchange, April 2025. Ac- cessed on 2025-06-09

work page 2025
[51]

Stack overflow developer survey 2025: Usage of large language models

Stack Overflow. Stack overflow developer survey 2025: Usage of large language models. https://survey.stackoverflow.co/2025/, 2025. OpenAI’s GPT models used by 81%, Claude Sonnet by 43% of developers [52]

work page 2025
[52]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)(Bangkok, Thailand, Aug

Strich, J., Schneider, F., Nikishina, I., and Biemann, C.On Improving Repository-Level Code QA for Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)(Bangkok, Thailand, Aug. 2024), X. Fu and E. Fleisig, Eds., Association for Computational Linguistics, pp. 209–244

work page 2024
[53]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing(Miami, Florida, USA, Nov

Tong, W., and Zhang, T.CodeJudge: Evaluating code generation with large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing(Miami, Florida, USA, Nov. 2024), Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, Eds., Association for Computational Linguistics, pp. 20032–20051

work page 2024
[54]

InProceedings of the 33rd Interna- tional Conference on Software Engineering(New York, NY, USA, 2011), ICSE ’11, Association for Computing Machinery, p

Treude, C., Barzilay, O., and Storey, M.-A.How do programmers ask and answer questions on the web? (nier track). InProceedings of the 33rd Interna- tional Conference on Software Engineering(New York, NY, USA, 2011), ICSE ’11, Association for Computing Machinery, p. 804–807

work page 2011
[55]

S., De Almeida, F

V aillant, T. S., De Almeida, F. D., Silveira Neto, P. A. M., Gao, C., Bosch, J., and Santana de Almeida, E.Developers’ perceptions on the impact of chatgpt in software development: A survey.CoRR abs/2405.12195(2024)

work page arXiv 2024
[56]

Y., and Xia, X.Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering

W ang, R., Guo, J., Gao, C., Fan, G., Chong, C. Y., and Xia, X.Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering. Proc. ACM Softw. Eng. 2, ISSTA (June 2025)

work page 2025
[57]

Emergent Abilities of Large Language Models

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W.Emergent Abilities of Large Language Models, Oct. 2022. arXiv:2206.07682 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

E., and Li, S.Measuring Program Comprehension: A Large-Scale Field Study with Professionals.IEEE Transactions on Software Engineering 44, 10 (Oct

Xia, X., Bao, L., Lo, D., Xing, Z., Hassan, A. E., and Li, S.Measuring Program Comprehension: A Large-Scale Field Study with Professionals.IEEE Transactions on Software Engineering 44, 10 (Oct. 2018), 951–976. Conference Name: IEEE Transactions on Software Engineering

work page 2018
[59]

F., Alon, U., Neubig, G., and Hellendoorn, V

Xu, F. F., Alon, U., Neubig, G., and Hellendoorn, V. J.A Systematic Evaluation of Large Language Models of Code, May 2022. arXiv:2202.13169 [cs]

work page arXiv 2022
[60]

Yin, Y., Ma, L., Gong, Y., Shi, Y., W ahab, F., and Zhao, Y.Deep semantics- enhanced neural code search.Electronics 13, 23 (2024)

work page 2024
[61]

InThe Twelfth International Con- ference on Learning Representations(2024)

Yoran, O., Wolfson, T., Ram, O., and Berant, J.Making retrieval-augmented language models robust to irrelevant context. InThe Twelfth International Con- ference on Learning Representations(2024)

work page 2024
[62]

InProceedings of the International Conference on Software Engineering and Knowledge Engineering (SEKE)(2023)

Zhang, B., Liang, P., Zhou, X., Ahmad, A., and W aseem, M.Practices and challenges of using github copilot: An empirical study. InProceedings of the International Conference on Software Engineering and Knowledge Engineering (SEKE)(2023)

work page 2023
[63]

E.Reading answers on stack overflow: Not enough!IEEE Transactions on Software Engineering 47, 11 (2021), 2520–2533

Zhang, H., W ang, S., Chen, T.-H., and Hassan, A. E.Reading answers on stack overflow: Not enough!IEEE Transactions on Software Engineering 47, 11 (2021), 2520–2533

work page 2021
[64]

E.An empirical study of obsolete answers on stack overflow.IEEE Transactions on Software Engineering 47, 4 (2021), 850–862

Zhang, H., W ang, S., Chen, T.-H., Zou, Y., and Hassan, A. E.An empirical study of obsolete answers on stack overflow.IEEE Transactions on Software Engineering 47, 4 (2021), 850–862

work page 2021
[65]

Unifying the perspectives of NLP and software engineering: A survey on language models for code.Trans

Zhang, Z., Chen, C., Liu, B., Liao, C., Gong, Z., Yu, H., Li, J., and W ang, R. Unifying the perspectives of NLP and software engineering: A survey on language models for code.Trans. Mach. Learn. Res. 2024(2024)

work page 2024
[66]

P., Zhang, H., Gonzalez, J

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I.Judging llm-as- a-judge with mt-bench and chatbot arena. InProceedings of the 37th International Conference on Neural Information Processing Systems(Red Hook, NY, USA, 2023), NIPS ’23, Curran Associates Inc

work page 2023