pith. sign in

arxiv: 2606.25402 · v1 · pith:PJVJ5IZYnew · submitted 2026-06-24 · 💻 cs.SE · cs.AI

LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models

Pith reviewed 2026-06-25 20:36 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords code generationlibrary evolutiontemporal knowledgeLLMsbenchmarksAPI versionssoftware engineering
0
0 comments X

The pith

Current code generation models cannot distinguish between different versions of evolving libraries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark to test whether large language models keep track of which API version they are supposed to use when generating code for Python libraries that change across releases. It shows that models produce correct code for stable APIs regardless of the target version, but accuracy drops when APIs evolve, indicating the models treat all versions as interchangeable. Prompting with the target version number gives no improvement, yet inserting relevant documentation raises accuracy. Real projects frequently depend on older library releases, so this gap creates a practical problem for using models in maintenance or legacy code tasks. The work frames the issue as a result of training on mixed historical data without mechanisms to separate versions.

Core claim

State-of-the-art models are largely version-oblivious: performance degrades for evolving APIs, while for stable APIs it remains the same across versions. Moreover, simply specifying the target version provides no benefit, while relevant documentation significantly boosts models' accuracy.

What carries the argument

LibEvoBench benchmark spanning multiple versions of Python libraries together with the SEUS metric that scores consistency on version-specific code tasks.

If this is right

  • Models will produce anachronistic API calls when asked to work with older releases of changing libraries.
  • Current training on temporally mixed data leaves no built-in way for models to reason about version differences.
  • Adding documentation to prompts can compensate for the missing version awareness in some cases.
  • New training methods will be required to give models explicit temporal grounding for library knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams maintaining codebases on older library versions may see more errors when using these models for assistance.
  • The benchmark could be applied to measure whether future training runs that include version tags improve results.
  • Similar tests might reveal whether the same version-oblivious behavior appears in other languages or domains.

Load-bearing premise

The benchmark tasks and SEUS metric measure only version-specific knowledge and are not affected by other patterns in the models' training data or by prompt wording.

What would settle it

Run the same models on the benchmark after fine-tuning them on version-labeled documentation and check whether accuracy rises only on the evolving-API tasks while staying flat on stable ones.

Figures

Figures reproduced from arXiv: 2606.25402 by Arie van Deursen, Daniele Cipollone, Egor Bogomolov, Maliheh Izadi, Sergey Titov.

Figure 1
Figure 1. Figure 1: Overview of the three evaluation tasks and the API-C noise reduction levels. API-C uses real code contexts with progressively richer prompts (L0–L3). API-I isolates API identification from a redacted description. SR probes signature recall from memory. Full prompt templates and response extraction details are in Appendix F. 2.2. Data Collection The benchmark construction pipeline is fully automated: starti… view at source ↗
Figure 2
Figure 2. Figure 2: Mean score across API-C@L1, API-I, and SR as a function of normalized library version progression. Stable APIs (dashed) remain flat; Evolving APIs (solid) degrade toward recent versions across all model families. Further breakdown in Appendix B. L0 L1 L2 L3 50 60 70 80 90 100 API-C EM (%) OpenAI GPT Stable / Evolving GPT-4.1 GPT-5 GPT-5.1 GPT-5.4 GPT-5.5 L0 L1 L2 L3 Prompt augmentation level 50 60 70 80 90… view at source ↗
Figure 3
Figure 3. Figure 3: API-C noise reduction staircase aggregated over libraries. Redacted documentation context drives the L0→L1 gain (code→code+doc); adding a version constraint at L2 (code+doc+version) yields no further improvement. L3 (API name+version) shifts the task to parameter prediction. and versions, yet this should not be mistaken for genuine version awareness. When models fail here, they systemat￾ically fall back on… view at source ↗
Figure 4
Figure 4. Figure 4: API-C@L1 exact match on Evolving APIs per PyTorch release. Per-version resolution on real code completion instances across all four model families. 40 60 80 100 API-C@L2 PyTorch 40 60 80 100 NumPy 40 60 80 100 SciPy 40 60 80 100 API-I 40 60 80 100 40 60 80 100 40 60 80 100 API-C@L3 40 60 80 100 40 60 80 100 1.12 1.13 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 40 60 80 100 SR 1.21 1.22 1.23 1.24 1.25 1.26 2.0 2.1 2.2 … view at source ↗
Figure 5
Figure 5. Figure 5: Per-version taxonomy breakdown for GPT-5.1 across API-level (API Call with version API-C@L2, API Identification API-I) and parameter-level (API Call parameter recall API-C@L3, Signature Recall SR) predictions. The blue line reports parameter recall for API-C@L3. below 86%. However, retention is uninformative for less powerful models—Qwen3.5 122B achieves 87.8% retention despite ranking 12th, because both i… view at source ↗
Figure 6
Figure 6. Figure 6: Library co-occurrence graph mined from ∼70k GitHub requirements.txt files. Node size reflects individual support; edge weight reflects joint support. Filters: ≥ 8% individual, ≥ 7% joint. Version selection. For each selected library, we extract every pinned version specifier (== operator) from the same manifest corpus, yielding a real-world frequency distribution over library versions weighted by how often… view at source ↗
Figure 7
Figure 7. Figure 7: Normalized version-usage density aggregated across all libraries mined from GitHub manifests. The horizontal axis maps each library’s release history onto [0, 1]. The same pattern holds for our three target libraries individually [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-library version-usage distribution mined from pinned == specifiers in real manifest files. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comprehensive performance breakdown across tasks, aggregated across libraries [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-version taxonomy breakdown for Sonnet 4 across API-level (API Call with version API-C@L2, API Identification API-I) and parameter-level (API Call parameter recall API-C@L3, Signature Recall SR) predictions. The blue line reports parameter recall for API-C@L3. 40 60 80 100 API-C@L2 PyTorch 40 60 80 100 NumPy 40 60 80 100 SciPy 40 60 80 100 API-I 40 60 80 100 40 60 80 100 40 60 80 100 API-C@L3 40 60 80 … view at source ↗
Figure 11
Figure 11. Figure 11: Per-version taxonomy breakdown for Gemini 2.5 Flash across API-level (API Call with version API-C@L2, API Identification API-I) and parameter-level (API Call parameter recall API-C@L3, Signature Recall SR) predictions. The blue line reports parameter recall for API-C@L3. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Per-version taxonomy breakdown for Qwen3.5 397B across API-level (API Call with version API-C@L2, API Identification API-I) and parameter-level (API Call parameter recall API-C@L3, Signature Recall SR) predictions. The blue line reports parameter recall for API-C@L3 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
read the original abstract

Large software projects often depend on older versions of libraries, even as APIs continue to evolve across releases. This creates a challenge for LLMs: they must maintain knowledge of multiple API versions, not merely the latest or most common one. However, current LLMs are trained on temporally mixed corpora and lack explicit mechanisms for such version-specific reasoning, leading to anachronistic errors - calling APIs as they exist in a different library version. To systematically evaluate this phenomenon, we introduce LibEvoBench, a multi-task benchmark spanning multiple versions of widely used Python libraries, along with a new metric, the Software Evolution Understanding Score (SEUS), to measure models' consistency when working with evolving APIs. Our results show that state-of-the-art models are largely version-oblivious: performance degrades for evolving APIs, while for stable APIs it remains the same across versions. Moreover, simply specifying the target version provides no benefit, while relevant documentation significantly boosts models' accuracy. These findings highlight a systematic limitation of current training paradigms and motivate new approaches for temporally grounded knowledge in code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LibEvoBench, a multi-task benchmark covering multiple versions of common Python libraries, and the SEUS metric to quantify models' consistency on evolving APIs. It reports that state-of-the-art code generation models are largely version-oblivious: performance drops on APIs that change across releases but stays constant on stable APIs; simply naming the target version in the prompt yields no improvement, whereas supplying relevant documentation does.

Significance. If the benchmark tasks and SEUS metric are shown to isolate temporal version knowledge without confounding by training-data frequency or prompt leakage, the findings would identify a concrete limitation of current pre-training regimes for code LLMs and supply a reusable evaluation resource for future work on temporally grounded code generation.

major comments (2)
  1. [Abstract / §3] Abstract and presumed §3 (benchmark construction): the central claim that models are 'version-oblivious' rests on the assumption that LibEvoBench tasks and the SEUS metric isolate temporal stratification; without explicit controls for API-version frequency in the pre-training corpus or checks for prompt leakage, observed performance differences could be explained by data imbalance rather than lack of version-specific reasoning.
  2. [Abstract / §4] Abstract and presumed §4 (experiments): the statement that 'simply specifying the target version provides no benefit' is load-bearing for the version-obliviousness conclusion, yet the abstract supplies no description of how the version identifier was inserted into the prompt template or whether the model was given any mechanism to condition on it.
minor comments (2)
  1. The paper should report the exact number of libraries, versions per library, and task templates used in LibEvoBench so that reproducibility and coverage can be assessed.
  2. Clarify the precise formula for SEUS and whether it normalizes for task difficulty across stable vs. evolving APIs.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the constructive feedback. We address each major comment below, clarifying our methodology and noting where revisions are appropriate.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and presumed §3 (benchmark construction): the central claim that models are 'version-oblivious' rests on the assumption that LibEvoBench tasks and the SEUS metric isolate temporal stratification; without explicit controls for API-version frequency in the pre-training corpus or checks for prompt leakage, observed performance differences could be explained by data imbalance rather than lack of version-specific reasoning.

    Authors: The SEUS metric is constructed to compare performance deltas on evolving APIs versus stable APIs drawn from the same libraries and task templates, which provides an internal control for library-level frequency effects. We performed manual verification that task prompts do not contain verbatim excerpts from public documentation that would constitute leakage. We agree that direct frequency counts from proprietary pre-training corpora are unavailable and will add an explicit limitations paragraph discussing this potential confound. revision: partial

  2. Referee: [Abstract / §4] Abstract and presumed §4 (experiments): the statement that 'simply specifying the target version provides no benefit' is load-bearing for the version-obliviousness conclusion, yet the abstract supplies no description of how the version identifier was inserted into the prompt template or whether the model was given any mechanism to condition on it.

    Authors: Section 4 fully specifies the prompt templates, including the exact phrasing used to insert the target version (a short prefix such as "Target Python version: 3.8"). We will revise the abstract to include a concise description of the version-specification condition so that the claim is self-contained. revision: yes

standing simulated objections not resolved
  • Direct measurement or explicit controls for the frequency of individual API versions within the pre-training corpora of closed-source models, which is not publicly accessible.

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark

full rationale

This is an empirical benchmark paper introducing LibEvoBench and the SEUS metric to evaluate LLMs on version-specific API knowledge. It contains no mathematical derivation chain, no fitted parameters presented as predictions, and no load-bearing self-citations that reduce claims to unverified inputs. The central findings rest on experimental results from constructed tasks, which are independently falsifiable via replication on the benchmark rather than by construction from the paper's own definitions or prior self-citations. No steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described; the work is an empirical benchmark introduction.

pith-pipeline@v0.9.1-grok · 5733 in / 952 out tokens · 13112 ms · 2026-06-25T20:36:46.661145+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 37 canonical work pages · 11 internal anchors

  1. [1]

    Introducing Sonnet 4.6 Anthropic , 2026

    Anthropic . Introducing Sonnet 4.6 Anthropic , 2026. URL https://www.anthropic.com/news/claude-sonnet-4-6

  2. [2]

    v., Izadi, M., and Bryksin, T

    Bogomolov, E., Eliseeva, A., Galimzyanov, T., Glukhov, E., Shapkin, A., Tigina, M., Golubev, Y., Kovrigin, A., Deursen, A. v., Izadi, M., and Bryksin, T. Long Code Arena : a Set of Benchmarks for Long - Context Code Models , June 2024. URL http://arxiv.org/abs/2406.11612. arXiv:2406.11612 [cs]

  3. [3]

    Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware , May 2025

    Chen, Y., Chen, M., Gao, C., Jiang, Z., Li, Z., and Ma, Y. Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware , May 2025. URL http://arxiv.org/abs/2505.05057. arXiv:2505.05057 [cs]

  4. [5]

    Gemini: A Family of Highly Capable Multimodal Models , May 2025

    Gemini, T. Gemini: A Family of Highly Capable Multimodal Models , May 2025. URL http://arxiv.org/abs/2312.11805. arXiv:2312.11805 [cs]

  5. [6]

    K., and Kumar, V

    Jain, N., Kwiatkowski, R., Ray, B., Ramanathan, M. K., and Kumar, V. On Mitigating Code LLM Hallucinations with API Documentation , July 2024. URL http://arxiv.org/abs/2407.09726. arXiv:2407.09726 [cs]

  6. [7]

    U., Wang, Z., Jain, N., Qian, H., Ray, B., Ramanathan, M

    Kuhar, S., Ahmad, W. U., Wang, Z., Jain, N., Qian, H., Ray, B., Ramanathan, M. K., Ma, X., and Deoras, A. LibEvolutionEval : A Benchmark and Study for Version - Specific Code Generation , November 2024. URL http://arxiv.org/abs/2412.04478. arXiv:2412.04478 [cs]

  7. [9]

    Beyond Functional Correctness : Exploring Hallucinations in LLM - Generated Code , January 2026

    Liu, F., Liu, Y., Shi, L., Yang, Z., Zhang, L., Lian, X., Li, Z., and Ma, Y. Beyond Functional Correctness : Exploring Hallucinations in LLM - Generated Code , January 2026. URL http://arxiv.org/abs/2404.00971. arXiv:2404.00971 [cs] version: 3

  8. [10]

    L., Pandit, S., Ye, X., Choi, E., and Durrett, G

    Liu, Z. L., Pandit, S., Ye, X., Choi, E., and Durrett, G. CodeUpdateArena : Benchmarking Knowledge Editing on API Updates . October 2024. URL https://openreview.net/forum?id=ecRyUAPshY

  9. [11]

    Lozhkov, A., Li, R., Allal, L. B., Cassano, F., Lamy-Poirier, J., Tazi, N., Tang, A., Pykhtar, D., Liu, J., Wei, Y., Liu, T., Tian, M., Kocetkov, D., Zucker, A., Belkada, Y., Wang, Z., Liu, Q., Abulkhanov, D., Paul, I., Li, Z., Li, W.-D., Risdal, M., Li, J., Zhu, J., Zhuo, T. Y., Zheltonozhskii, E., Dade, N. O. O., Yu, W., Krauß, L., Jain, N., Su, Y., He,...

  10. [12]

    B., Rish, I., Kahou, S

    Misra, D., Islah, N., May, V., Rauby, B., Wang, Z., Gehring, J., Orvieto, A., Chaudhary, M., Muller, E. B., Rish, I., Kahou, S. E., and Caccia, M. GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities , July 2025. URL http://arxiv.org/abs/2507.12367. arXiv:2507.12367 [cs]

  11. [13]

    Introducing GPT -5.5, April 2026

    OpenAI . Introducing GPT -5.5, April 2026. URL https://openai.com/index/introducing-gpt-5-5/

  12. [14]

    Spracklen, J., Wijewickrama, R., Sakib, A. H. M. N., Maiti, A., Viswanath, B., and Jadliwala, M. We Have a Package for You ! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs , March 2025. URL http://arxiv.org/abs/2406.10279. arXiv:2406.10279 [cs]

  13. [15]

    Qwen3.5- Omni Technical Report , April 2026

    Team, Q. Qwen3.5- Omni Technical Report , April 2026. URL http://arxiv.org/abs/2604.15804. arXiv:2604.15804 [cs]

  14. [16]

    CodeHalu : Investigating Code Hallucinations in LLMs via Execution -based Verification , January 2025

    Tian, Y., Yan, W., Yang, Q., Zhao, X., Chen, Q., Wang, W., Luo, Z., Ma, L., and Song, D. CodeHalu : Investigating Code Hallucinations in LLMs via Execution -based Verification , January 2025. URL http://arxiv.org/abs/2405.00253. arXiv:2405.00253 [cs]

  15. [17]

    LLMs Meet Library Evolution : Evaluating Deprecated API Usage in LLM -based Code Completion , February 2025

    Wang, C., Huang, K., Zhang, J., Feng, Y., Zhang, L., Liu, Y., and Peng, X. LLMs Meet Library Evolution : Evaluating Deprecated API Usage in LLM -based Code Completion , February 2025. URL http://arxiv.org/abs/2406.09834. arXiv:2406.09834 [cs]

  16. [18]

    TIME : A Multi -level Benchmark for Temporal Reasoning of LLMs in Real - World Scenarios , October 2025

    Wei, S., Li, W., Song, F., Luo, W., Zhuang, T., Tan, H., Guo, Z., and Wang, H. TIME : A Multi -level Benchmark for Temporal Reasoning of LLMs in Real - World Scenarios , October 2025. URL http://arxiv.org/abs/2505.12891. arXiv:2505.12891 [cs]

  17. [19]

    VersiCode : Towards Version -controllable Code Generation , October 2024

    Wu, T., Wu, W., Wang, X., Xu, K., Ma, S., Jiang, B., Yang, P., Xing, Z., Li, Y.-F., and Haffari, G. VersiCode : Towards Version -controllable Code Generation , October 2024. URL http://arxiv.org/abs/2406.07411. arXiv:2406.07411 [cs]

  18. [20]

    LLM Hallucinations in Practical Code Generation : Phenomena , Mechanism , and Mitigation , September 2024

    Zhang, Z., Wang, Y., Wang, C., Chen, J., and Zheng, Z. LLM Hallucinations in Practical Code Generation : Phenomena , Mechanism , and Mitigation , September 2024. URL http://arxiv.org/abs/2409.20550. arXiv:2409.20550 [cs] version: 1

  19. [21]

    Zhao, B., Brumbaugh, Z., Wang, Y., Hajishirzi, H., and Smith, N. A. Set the Clock : Temporal Alignment of Pretrained Language Models , June 2024. URL http://arxiv.org/abs/2402.16797. arXiv:2402.16797 [cs]

  20. [23]

    , month = jun, year =

    Zhao, Bowen and Brumbaugh, Zander and Wang, Yizhong and Hajishirzi, Hannaneh and Smith, Noah A. , month = jun, year =. Set the. doi:10.48550/arXiv.2402.16797 , abstract =

  21. [24]

    https://doi.org/10.1162/tacl_a_00459, https://aclanthology.org/2022.tacl-1.15/

    Dhingra, Bhuwan and Cole, Jeremy R. and Eisenschlos, Julian Martin and Gillick, Daniel and Eisenstein, Jacob and Cohen, William W. , editor =. Time-. Transactions of the Association for Computational Linguistics , publisher =. 2022 , pages =. doi:10.1162/tacl_a_00459 , abstract =

  22. [25]

    and Ouni, Ali and Ishio, Takashi and Inoue, Katsuro , year=

    Do. Empirical Software Engineering , author =. 2018 , note =. doi:10.1007/s10664-017-9521-5 , abstract =

  23. [26]

    OpenAI , month = apr, year =

    Introducing. OpenAI , month = apr, year =

  24. [27]

    Team Gemini , month = may, year =. Gemini:. doi:10.48550/arXiv.2312.11805 , abstract =

  25. [28]

    Qwen3.5-Omni Technical Report

    Team, Qwen , month = apr, year =. Qwen3.5-. doi:10.48550/arXiv.2604.15804 , abstract =

  26. [29]

    doi:10.48550/arXiv.2505.12891 , abstract =

    Wei, Shaohang and Li, Wei and Song, Feifan and Luo, Wen and Zhuang, Tianyi and Tan, Haochen and Guo, Zhijiang and Wang, Houfeng , month = oct, year =. doi:10.48550/arXiv.2505.12891 , abstract =

  27. [30]

    Proceedings of the 63rd

    Zhu, Zhiyuan and Liao, Yusheng and Chen, Zhe and Wang, Yuhao and Guan, Yunfeng and Wang, Yanfeng and Wang, Yu , editor =. Proceedings of the 63rd. 2025 , pages =. doi:10.18653/v1/2025.acl-long.788 , abstract =

  28. [31]

    doi:10.48550/arXiv.2406.07411 , abstract =

    Wu, Tongtong and Wu, Weigang and Wang, Xingyu and Xu, Kang and Ma, Suyu and Jiang, Bo and Yang, Ping and Xing, Zhenchang and Li, Yuan-Fang and Haffari, Gholamreza , month = oct, year =. doi:10.48550/arXiv.2406.07411 , abstract =

  29. [32]

    Liu, Zeyu Leo and Pandit, Shrey and Ye, Xi and Choi, Eunsol and Durrett, Greg , month = oct, year =

  30. [33]

    Efficient Training of Language Models to Fill in the Middle

    Bavarian, Mohammad and Jun, Heewoo and Tezak, Nikolas and Schulman, John and McLeavey, Christine and Tworek, Jerry and Chen, Mark , month = jul, year =. Efficient. doi:10.48550/arXiv.2207.14255 , abstract =

  31. [34]

    Bogomolov, Egor and Eliseeva, Aleksandra and Galimzyanov, Timur and Glukhov, Evgeniy and Shapkin, Anton and Tigina, Maria and Golubev, Yaroslav and Kovrigin, Alexander and Deursen, Arie van and Izadi, Maliheh and Bryksin, Timofey , month = jun, year =. Long. doi:10.48550/arXiv.2406.11612 , abstract =

  32. [35]

    Survey of hallucination in natural language generation,

    Survey of. ACM Computing Surveys , author =. 2023 , note =. doi:10.1145/3571730 , abstract =

  33. [36]

    doi:10.48550/arXiv.2406.09834 , abstract =

    Wang, Chong and Huang, Kaifeng and Zhang, Jian and Feng, Yebo and Zhang, Lyuye and Liu, Yang and Peng, Xin , month = feb, year =. doi:10.48550/arXiv.2406.09834 , abstract =

  34. [37]

    2025 , issue_date =

    A. ACM Transactions on Information Systems , author =. 2025 , note =. doi:10.1145/3703155 , abstract =

  35. [38]

    Jain, Nihal and Kwiatkowski, Robert and Ray, Baishakhi and Ramanathan, Murali Krishna and Kumar, Varun , month = jul, year =. On. doi:10.48550/arXiv.2407.09726 , abstract =

  36. [39]

    Spracklen, Joseph and Wijewickrama, Raveen and Sakib, A. H. M. Nazmus and Maiti, Anindya and Viswanath, Bimal and Jadliwala, Murtuza , month = mar, year =. We. doi:10.48550/arXiv.2406.10279 , abstract =

  37. [40]

    doi:10.48550/arXiv.2409.20550 , abstract =

    Zhang, Ziyao and Wang, Yanlin and Wang, Chong and Chen, Jiachi and Zheng, Zibin , month = sep, year =. doi:10.48550/arXiv.2409.20550 , abstract =

  38. [41]

    doi:10.48550/arXiv.2405.00253 , abstract =

    Tian, Yuchen and Yan, Weixiang and Yang, Qian and Zhao, Xuandong and Chen, Qian and Wang, Wen and Luo, Ziyang and Ma, Lei and Song, Dawn , month = jan, year =. doi:10.48550/arXiv.2405.00253 , abstract =

  39. [42]

    Liu, Fang and Liu, Yang and Shi, Lin and Yang, Zhen and Zhang, Li and Lian, Xiaoli and Li, Zhongqi and Ma, Yuchi , month = jan, year =. Beyond. doi:10.48550/arXiv.2404.00971 , abstract =

  40. [43]

    Chen, Yujia and Chen, Mingyu and Gao, Cuiyun and Jiang, Zhihan and Li, Zhongqi and Ma, Yuchi , month = may, year =. Towards. doi:10.48550/arXiv.2505.05057 , abstract =

  41. [44]

    Why Language Models Hallucinate

    Kalai, Adam Tauman and Nachum, Ofir and Vempala, Santosh S. and Zhang, Edwin , month = sep, year =. Why. doi:10.48550/arXiv.2509.04664 , abstract =

  42. [45]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and Küttler, Heinrich and Lewis, Mike and Yih, Wen-tau and Rocktäschel, Tim and Riedel, Sebastian and Kiela, Douwe , month = apr, year =. Retrieval-. doi:10.48550/arXiv.2005.11401 , abstract =

  43. [46]

    arXiv.org , author =

    Retrieval-. arXiv.org , author =

  44. [47]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Gao, Yunfan and Xiong, Yun and Gao, Xinyu and Jia, Kangxiang and Pan, Jinliu and Bi, Yuxi and Dai, Yi and Sun, Jiawei and Wang, Meng and Wang, Haofen , month = mar, year =. Retrieval-. doi:10.48550/arXiv.2312.10997 , abstract =

  45. [48]

    StarCoder 2 and The Stack v2: The Next Generation

    Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and Liu, Tianyang and Tian, Max and Kocetkov, Denis and Zucker, Arthur and Belkada, Younes and Wang, Zijian and Liu, Qian and Abulkhanov, Dmitry and Paul, Indraneil and Li, Z...

  46. [49]

    Peng, Sida and Kalliamvakou, Eirini and Cihon, Peter and Demirer, Mert , month = feb, year =. The. doi:10.48550/arXiv.2302.06590 , abstract =

  47. [50]

    Knowledge

    Xu, Rongwu and Qi, Zehan and Guo, Zhijiang and Wang, Cunxiang and Wang, Hongru and Zhang, Yue and Xu, Wei , month = jun, year =. Knowledge. doi:10.48550/arXiv.2403.08319 , abstract =

  48. [51]

    Adaptive

    Xie, Jian and Zhang, Kai and Chen, Jiangjie and Lou, Renze and Su, Yu , month = feb, year =. Adaptive. doi:10.48550/arXiv.2305.13300 , abstract =

  49. [52]

    TreeRanker: Fast and Model-agnostic Ranking System for Code Suggestions in IDEs

    Cipollone, Daniele and Bogomolov, Egor and Deursen, Arie van and Izadi, Maliheh , month = aug, year =. doi:10.48550/arXiv.2508.02455 , abstract =

  50. [53]

    Measuring. Commun. ACM , author =. 2024 , pages =. doi:10.1145/3633453 , abstract =

  51. [54]

    Identifying and

    Zhuo, Terry Yue and He, Junda and Sun, Jiamou and Xing, Zhenchang and Lo, David and Grundy, John and Du, Xiaoning , month = dec, year =. Identifying and. doi:10.48550/arXiv.2503.22821 , abstract =

  52. [55]

    Ashik, Ahmed Nusayer and Wang, Shaowei and Chen, Tse-Hsun and Asaduzzaman, Muhammad and Tian, Yuan , month = apr, year =. When. doi:10.48550/arXiv.2604.09515 , abstract =

  53. [56]

    and Rish, Irina and Kahou, Samira Ebrahimi and Caccia, Massimo , month = jul, year =

    Misra, Diganta and Islah, Nizar and May, Victor and Rauby, Brice and Wang, Zihan and Gehring, Justine and Orvieto, Antonio and Chaudhary, Muawiz and Muller, Eilif B. and Rish, Irina and Kahou, Samira Ebrahimi and Caccia, Massimo , month = jul, year =. doi:10.48550/arXiv.2507.12367 , abstract =

  54. [57]

    TimeMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks

    Fujii, Ryo and Morishita, Makoto and Yano, Kazuki and Suzuki, Jun , month = jan, year =. doi:10.48550/arXiv.2601.22597 , abstract =

  55. [58]

    Yang, Jian and Liu, Xianglong and Lv, Weifeng and Deng, Ken and Guo, Shawn and Jing, Lin and Li, Yizhi and Liu, Shark and Luo, Xianzhen and Luo, Yuyu and Pan, Changzai and Shi, Ensheng and Tan, Yingshui and Tao, Renshuai and Wu, Jiajun and Wu, Xianjie and Wu, Zhenhe and Zan, Daoguang and Zhang, Chenchen and Zhang, Wei and Zhu, He and Zhuo, Terry Yue and C...

  56. [59]

    Pavlichenko, Nikita and Nazarov, Iurii and Dolgov, Ivan and Garanina, Ekaterina and Ustalov, Dmitry and Bondyrev, Ivan and Lysaniuk, Kseniia and Vu, Evgeniia and Chekmenev, Kirill and Shtok, Joseph and Golubev, Yaroslav and Semenkin, Anton and Sazanovich, Uladzislau , month = oct, year =. Mellum:. doi:10.48550/arXiv.2510.05788 , abstract =

  57. [60]

    Zhang, Quanjun and Fang, Chunrong and Xie, Yang and Zhang, Yaxin and Yang, Yun and Sun, Weisong and Yu, Shengcheng and Chen, Zhenyu , month = sep, year =. A. doi:10.48550/arXiv.2312.15223 , abstract =

  58. [61]

    doi:10.48550/arXiv.2412.04478 , abstract =

    Kuhar, Sachit and Ahmad, Wasi Uddin and Wang, Zijian and Jain, Nihal and Qian, Haifeng and Ray, Baishakhi and Ramanathan, Murali Krishna and Ma, Xiaofei and Deoras, Anoop , month = nov, year =. doi:10.48550/arXiv.2412.04478 , abstract =

  59. [62]

    doi:10.48550/arXiv.2502.16645 , abstract =

    Wang, Chenlong and Chu, Zhaoyang and Cheng, Zhengxiang and Yang, Xuyi and Qiu, Kaiyue and Wan, Yao and Zhao, Zhou and Shi, Xuanhua and Chen, Dongping , month = jun, year =. doi:10.48550/arXiv.2502.16645 , abstract =