LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models

Arie van Deursen; Daniele Cipollone; Egor Bogomolov; Maliheh Izadi; Sergey Titov

arxiv: 2606.25402 · v1 · pith:PJVJ5IZYnew · submitted 2026-06-24 · 💻 cs.SE · cs.AI

LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models

Daniele Cipollone , Sergey Titov , Maliheh Izadi , Egor Bogomolov , Arie van Deursen This is my paper

Pith reviewed 2026-06-25 20:36 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords code generationlibrary evolutiontemporal knowledgeLLMsbenchmarksAPI versionssoftware engineering

0 comments

The pith

Current code generation models cannot distinguish between different versions of evolving libraries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark to test whether large language models keep track of which API version they are supposed to use when generating code for Python libraries that change across releases. It shows that models produce correct code for stable APIs regardless of the target version, but accuracy drops when APIs evolve, indicating the models treat all versions as interchangeable. Prompting with the target version number gives no improvement, yet inserting relevant documentation raises accuracy. Real projects frequently depend on older library releases, so this gap creates a practical problem for using models in maintenance or legacy code tasks. The work frames the issue as a result of training on mixed historical data without mechanisms to separate versions.

Core claim

State-of-the-art models are largely version-oblivious: performance degrades for evolving APIs, while for stable APIs it remains the same across versions. Moreover, simply specifying the target version provides no benefit, while relevant documentation significantly boosts models' accuracy.

What carries the argument

LibEvoBench benchmark spanning multiple versions of Python libraries together with the SEUS metric that scores consistency on version-specific code tasks.

If this is right

Models will produce anachronistic API calls when asked to work with older releases of changing libraries.
Current training on temporally mixed data leaves no built-in way for models to reason about version differences.
Adding documentation to prompts can compensate for the missing version awareness in some cases.
New training methods will be required to give models explicit temporal grounding for library knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams maintaining codebases on older library versions may see more errors when using these models for assistance.
The benchmark could be applied to measure whether future training runs that include version tags improve results.
Similar tests might reveal whether the same version-oblivious behavior appears in other languages or domains.

Load-bearing premise

The benchmark tasks and SEUS metric measure only version-specific knowledge and are not affected by other patterns in the models' training data or by prompt wording.

What would settle it

Run the same models on the benchmark after fine-tuning them on version-labeled documentation and check whether accuracy rises only on the evolving-API tasks while staying flat on stable ones.

Figures

Figures reproduced from arXiv: 2606.25402 by Arie van Deursen, Daniele Cipollone, Egor Bogomolov, Maliheh Izadi, Sergey Titov.

**Figure 1.** Figure 1: Overview of the three evaluation tasks and the API-C noise reduction levels. API-C uses real code contexts with progressively richer prompts (L0–L3). API-I isolates API identification from a redacted description. SR probes signature recall from memory. Full prompt templates and response extraction details are in Appendix F. 2.2. Data Collection The benchmark construction pipeline is fully automated: starti… view at source ↗

**Figure 2.** Figure 2: Mean score across API-C@L1, API-I, and SR as a function of normalized library version progression. Stable APIs (dashed) remain flat; Evolving APIs (solid) degrade toward recent versions across all model families. Further breakdown in Appendix B. L0 L1 L2 L3 50 60 70 80 90 100 API-C EM (%) OpenAI GPT Stable / Evolving GPT-4.1 GPT-5 GPT-5.1 GPT-5.4 GPT-5.5 L0 L1 L2 L3 Prompt augmentation level 50 60 70 80 90… view at source ↗

**Figure 3.** Figure 3: API-C noise reduction staircase aggregated over libraries. Redacted documentation context drives the L0→L1 gain (code→code+doc); adding a version constraint at L2 (code+doc+version) yields no further improvement. L3 (API name+version) shifts the task to parameter prediction. and versions, yet this should not be mistaken for genuine version awareness. When models fail here, they systematically fall back on… view at source ↗

**Figure 4.** Figure 4: API-C@L1 exact match on Evolving APIs per PyTorch release. Per-version resolution on real code completion instances across all four model families. 40 60 80 100 API-C@L2 PyTorch 40 60 80 100 NumPy 40 60 80 100 SciPy 40 60 80 100 API-I 40 60 80 100 40 60 80 100 40 60 80 100 API-C@L3 40 60 80 100 40 60 80 100 1.12 1.13 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 40 60 80 100 SR 1.21 1.22 1.23 1.24 1.25 1.26 2.0 2.1 2.2 … view at source ↗

**Figure 5.** Figure 5: Per-version taxonomy breakdown for GPT-5.1 across API-level (API Call with version API-C@L2, API Identification API-I) and parameter-level (API Call parameter recall API-C@L3, Signature Recall SR) predictions. The blue line reports parameter recall for API-C@L3. below 86%. However, retention is uninformative for less powerful models—Qwen3.5 122B achieves 87.8% retention despite ranking 12th, because both i… view at source ↗

**Figure 6.** Figure 6: Library co-occurrence graph mined from ∼70k GitHub requirements.txt files. Node size reflects individual support; edge weight reflects joint support. Filters: ≥ 8% individual, ≥ 7% joint. Version selection. For each selected library, we extract every pinned version specifier (== operator) from the same manifest corpus, yielding a real-world frequency distribution over library versions weighted by how often… view at source ↗

**Figure 7.** Figure 7: Normalized version-usage density aggregated across all libraries mined from GitHub manifests. The horizontal axis maps each library’s release history onto [0, 1]. The same pattern holds for our three target libraries individually [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Per-library version-usage distribution mined from pinned == specifiers in real manifest files. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Comprehensive performance breakdown across tasks, aggregated across libraries [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Per-version taxonomy breakdown for Sonnet 4 across API-level (API Call with version API-C@L2, API Identification API-I) and parameter-level (API Call parameter recall API-C@L3, Signature Recall SR) predictions. The blue line reports parameter recall for API-C@L3. 40 60 80 100 API-C@L2 PyTorch 40 60 80 100 NumPy 40 60 80 100 SciPy 40 60 80 100 API-I 40 60 80 100 40 60 80 100 40 60 80 100 API-C@L3 40 60 80 … view at source ↗

**Figure 11.** Figure 11: Per-version taxonomy breakdown for Gemini 2.5 Flash across API-level (API Call with version API-C@L2, API Identification API-I) and parameter-level (API Call parameter recall API-C@L3, Signature Recall SR) predictions. The blue line reports parameter recall for API-C@L3. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Per-version taxonomy breakdown for Qwen3.5 397B across API-level (API Call with version API-C@L2, API Identification API-I) and parameter-level (API Call parameter recall API-C@L3, Signature Recall SR) predictions. The blue line reports parameter recall for API-C@L3 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

read the original abstract

Large software projects often depend on older versions of libraries, even as APIs continue to evolve across releases. This creates a challenge for LLMs: they must maintain knowledge of multiple API versions, not merely the latest or most common one. However, current LLMs are trained on temporally mixed corpora and lack explicit mechanisms for such version-specific reasoning, leading to anachronistic errors - calling APIs as they exist in a different library version. To systematically evaluate this phenomenon, we introduce LibEvoBench, a multi-task benchmark spanning multiple versions of widely used Python libraries, along with a new metric, the Software Evolution Understanding Score (SEUS), to measure models' consistency when working with evolving APIs. Our results show that state-of-the-art models are largely version-oblivious: performance degrades for evolving APIs, while for stable APIs it remains the same across versions. Moreover, simply specifying the target version provides no benefit, while relevant documentation significantly boosts models' accuracy. These findings highlight a systematic limitation of current training paradigms and motivate new approaches for temporally grounded knowledge in code generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LibEvoBench and SEUS flag that code models handle stable APIs fine but drop on evolving ones, with docs helping more than version names, though task isolation needs checking.

read the letter

The paper introduces LibEvoBench, a multi-version benchmark across Python libraries, plus the SEUS metric to score consistency on evolving versus stable APIs. The main result is that current models show degraded performance on changing APIs, no improvement from simply stating the target version in prompts, and a clear lift when relevant documentation is added.

This is new in focusing explicitly on temporal stratification rather than single-version code generation. The stable-versus-evolving split gives a straightforward comparison that ties directly to maintenance of long-lived codebases, and the documentation finding lines up with known retrieval effects.

The setup is reasonable on its face for surfacing a practical limitation. The abstract frames the problem clearly and the reported patterns are consistent with how training corpora mix versions without explicit temporal signals.

The soft spot is whether the tasks and SEUS metric actually isolate version-specific knowledge. Without details on how versions were chosen, how prompts were built, or controls for API frequency and leakage in training data, the performance gap could reflect data distribution rather than version obliviousness. The reader's weakest assumption correctly identifies this as the load-bearing point.

This is for researchers working on code LLMs or SE tools that must support older library releases. A reader interested in benchmark design or temporal robustness would get concrete ideas from it.

It deserves peer review because the problem is real and the benchmark could be reusable if the controls hold up under scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper introduces LibEvoBench, a multi-task benchmark covering multiple versions of common Python libraries, and the SEUS metric to quantify models' consistency on evolving APIs. It reports that state-of-the-art code generation models are largely version-oblivious: performance drops on APIs that change across releases but stays constant on stable APIs; simply naming the target version in the prompt yields no improvement, whereas supplying relevant documentation does.

Significance. If the benchmark tasks and SEUS metric are shown to isolate temporal version knowledge without confounding by training-data frequency or prompt leakage, the findings would identify a concrete limitation of current pre-training regimes for code LLMs and supply a reusable evaluation resource for future work on temporally grounded code generation.

major comments (2)

[Abstract / §3] Abstract and presumed §3 (benchmark construction): the central claim that models are 'version-oblivious' rests on the assumption that LibEvoBench tasks and the SEUS metric isolate temporal stratification; without explicit controls for API-version frequency in the pre-training corpus or checks for prompt leakage, observed performance differences could be explained by data imbalance rather than lack of version-specific reasoning.
[Abstract / §4] Abstract and presumed §4 (experiments): the statement that 'simply specifying the target version provides no benefit' is load-bearing for the version-obliviousness conclusion, yet the abstract supplies no description of how the version identifier was inserted into the prompt template or whether the model was given any mechanism to condition on it.

minor comments (2)

The paper should report the exact number of libraries, versions per library, and task templates used in LibEvoBench so that reproducibility and coverage can be assessed.
Clarify the precise formula for SEUS and whether it normalizes for task difficulty across stable vs. evolving APIs.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the constructive feedback. We address each major comment below, clarifying our methodology and noting where revisions are appropriate.

read point-by-point responses

Referee: [Abstract / §3] Abstract and presumed §3 (benchmark construction): the central claim that models are 'version-oblivious' rests on the assumption that LibEvoBench tasks and the SEUS metric isolate temporal stratification; without explicit controls for API-version frequency in the pre-training corpus or checks for prompt leakage, observed performance differences could be explained by data imbalance rather than lack of version-specific reasoning.

Authors: The SEUS metric is constructed to compare performance deltas on evolving APIs versus stable APIs drawn from the same libraries and task templates, which provides an internal control for library-level frequency effects. We performed manual verification that task prompts do not contain verbatim excerpts from public documentation that would constitute leakage. We agree that direct frequency counts from proprietary pre-training corpora are unavailable and will add an explicit limitations paragraph discussing this potential confound. revision: partial
Referee: [Abstract / §4] Abstract and presumed §4 (experiments): the statement that 'simply specifying the target version provides no benefit' is load-bearing for the version-obliviousness conclusion, yet the abstract supplies no description of how the version identifier was inserted into the prompt template or whether the model was given any mechanism to condition on it.

Authors: Section 4 fully specifies the prompt templates, including the exact phrasing used to insert the target version (a short prefix such as "Target Python version: 3.8"). We will revise the abstract to include a concise description of the version-specification condition so that the claim is self-contained. revision: yes

standing simulated objections not resolved

Direct measurement or explicit controls for the frequency of individual API versions within the pre-training corpora of closed-source models, which is not publicly accessible.

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark

full rationale

This is an empirical benchmark paper introducing LibEvoBench and the SEUS metric to evaluate LLMs on version-specific API knowledge. It contains no mathematical derivation chain, no fitted parameters presented as predictions, and no load-bearing self-citations that reduce claims to unverified inputs. The central findings rest on experimental results from constructed tasks, which are independently falsifiable via replication on the benchmark rather than by construction from the paper's own definitions or prior self-citations. No steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described; the work is an empirical benchmark introduction.

pith-pipeline@v0.9.1-grok · 5733 in / 952 out tokens · 13112 ms · 2026-06-25T20:36:46.661145+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 37 canonical work pages · 11 internal anchors

[1]

Introducing Sonnet 4.6 Anthropic , 2026

Anthropic . Introducing Sonnet 4.6 Anthropic , 2026. URL https://www.anthropic.com/news/claude-sonnet-4-6

2026
[2]

v., Izadi, M., and Bryksin, T

Bogomolov, E., Eliseeva, A., Galimzyanov, T., Glukhov, E., Shapkin, A., Tigina, M., Golubev, Y., Kovrigin, A., Deursen, A. v., Izadi, M., and Bryksin, T. Long Code Arena : a Set of Benchmarks for Long - Context Code Models , June 2024. URL http://arxiv.org/abs/2406.11612. arXiv:2406.11612 [cs]

arXiv 2024
[3]

Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware , May 2025

Chen, Y., Chen, M., Gao, C., Jiang, Z., Li, Z., and Ma, Y. Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware , May 2025. URL http://arxiv.org/abs/2505.05057. arXiv:2505.05057 [cs]

arXiv 2025
[5]

Gemini: A Family of Highly Capable Multimodal Models , May 2025

Gemini, T. Gemini: A Family of Highly Capable Multimodal Models , May 2025. URL http://arxiv.org/abs/2312.11805. arXiv:2312.11805 [cs]

Pith/arXiv arXiv 2025
[6]

K., and Kumar, V

Jain, N., Kwiatkowski, R., Ray, B., Ramanathan, M. K., and Kumar, V. On Mitigating Code LLM Hallucinations with API Documentation , July 2024. URL http://arxiv.org/abs/2407.09726. arXiv:2407.09726 [cs]

arXiv 2024
[7]

U., Wang, Z., Jain, N., Qian, H., Ray, B., Ramanathan, M

Kuhar, S., Ahmad, W. U., Wang, Z., Jain, N., Qian, H., Ray, B., Ramanathan, M. K., Ma, X., and Deoras, A. LibEvolutionEval : A Benchmark and Study for Version - Specific Code Generation , November 2024. URL http://arxiv.org/abs/2412.04478. arXiv:2412.04478 [cs]

arXiv 2024
[9]

Beyond Functional Correctness : Exploring Hallucinations in LLM - Generated Code , January 2026

Liu, F., Liu, Y., Shi, L., Yang, Z., Zhang, L., Lian, X., Li, Z., and Ma, Y. Beyond Functional Correctness : Exploring Hallucinations in LLM - Generated Code , January 2026. URL http://arxiv.org/abs/2404.00971. arXiv:2404.00971 [cs] version: 3

arXiv 2026
[10]

L., Pandit, S., Ye, X., Choi, E., and Durrett, G

Liu, Z. L., Pandit, S., Ye, X., Choi, E., and Durrett, G. CodeUpdateArena : Benchmarking Knowledge Editing on API Updates . October 2024. URL https://openreview.net/forum?id=ecRyUAPshY

2024
[11]

Lozhkov, A., Li, R., Allal, L. B., Cassano, F., Lamy-Poirier, J., Tazi, N., Tang, A., Pykhtar, D., Liu, J., Wei, Y., Liu, T., Tian, M., Kocetkov, D., Zucker, A., Belkada, Y., Wang, Z., Liu, Q., Abulkhanov, D., Paul, I., Li, Z., Li, W.-D., Risdal, M., Li, J., Zhu, J., Zhuo, T. Y., Zheltonozhskii, E., Dade, N. O. O., Yu, W., Krauß, L., Jain, N., Su, Y., He,...

Pith/arXiv arXiv 2024
[12]

B., Rish, I., Kahou, S

Misra, D., Islah, N., May, V., Rauby, B., Wang, Z., Gehring, J., Orvieto, A., Chaudhary, M., Muller, E. B., Rish, I., Kahou, S. E., and Caccia, M. GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities , July 2025. URL http://arxiv.org/abs/2507.12367. arXiv:2507.12367 [cs]

arXiv 2025
[13]

Introducing GPT -5.5, April 2026

OpenAI . Introducing GPT -5.5, April 2026. URL https://openai.com/index/introducing-gpt-5-5/

2026
[14]

Spracklen, J., Wijewickrama, R., Sakib, A. H. M. N., Maiti, A., Viswanath, B., and Jadliwala, M. We Have a Package for You ! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs , March 2025. URL http://arxiv.org/abs/2406.10279. arXiv:2406.10279 [cs]

arXiv 2025
[15]

Qwen3.5- Omni Technical Report , April 2026

Team, Q. Qwen3.5- Omni Technical Report , April 2026. URL http://arxiv.org/abs/2604.15804. arXiv:2604.15804 [cs]

Pith/arXiv arXiv 2026
[16]

CodeHalu : Investigating Code Hallucinations in LLMs via Execution -based Verification , January 2025

Tian, Y., Yan, W., Yang, Q., Zhao, X., Chen, Q., Wang, W., Luo, Z., Ma, L., and Song, D. CodeHalu : Investigating Code Hallucinations in LLMs via Execution -based Verification , January 2025. URL http://arxiv.org/abs/2405.00253. arXiv:2405.00253 [cs]

arXiv 2025
[17]

LLMs Meet Library Evolution : Evaluating Deprecated API Usage in LLM -based Code Completion , February 2025

Wang, C., Huang, K., Zhang, J., Feng, Y., Zhang, L., Liu, Y., and Peng, X. LLMs Meet Library Evolution : Evaluating Deprecated API Usage in LLM -based Code Completion , February 2025. URL http://arxiv.org/abs/2406.09834. arXiv:2406.09834 [cs]

arXiv 2025
[18]

TIME : A Multi -level Benchmark for Temporal Reasoning of LLMs in Real - World Scenarios , October 2025

Wei, S., Li, W., Song, F., Luo, W., Zhuang, T., Tan, H., Guo, Z., and Wang, H. TIME : A Multi -level Benchmark for Temporal Reasoning of LLMs in Real - World Scenarios , October 2025. URL http://arxiv.org/abs/2505.12891. arXiv:2505.12891 [cs]

arXiv 2025
[19]

VersiCode : Towards Version -controllable Code Generation , October 2024

Wu, T., Wu, W., Wang, X., Xu, K., Ma, S., Jiang, B., Yang, P., Xing, Z., Li, Y.-F., and Haffari, G. VersiCode : Towards Version -controllable Code Generation , October 2024. URL http://arxiv.org/abs/2406.07411. arXiv:2406.07411 [cs]

arXiv 2024
[20]

LLM Hallucinations in Practical Code Generation : Phenomena , Mechanism , and Mitigation , September 2024

Zhang, Z., Wang, Y., Wang, C., Chen, J., and Zheng, Z. LLM Hallucinations in Practical Code Generation : Phenomena , Mechanism , and Mitigation , September 2024. URL http://arxiv.org/abs/2409.20550. arXiv:2409.20550 [cs] version: 1

arXiv 2024
[21]

Zhao, B., Brumbaugh, Z., Wang, Y., Hajishirzi, H., and Smith, N. A. Set the Clock : Temporal Alignment of Pretrained Language Models , June 2024. URL http://arxiv.org/abs/2402.16797. arXiv:2402.16797 [cs]

arXiv 2024
[23]

, month = jun, year =

Zhao, Bowen and Brumbaugh, Zander and Wang, Yizhong and Hajishirzi, Hannaneh and Smith, Noah A. , month = jun, year =. Set the. doi:10.48550/arXiv.2402.16797 , abstract =

work page doi:10.48550/arxiv.2402.16797
[24]

https://doi.org/10.1162/tacl_a_00459, https://aclanthology.org/2022.tacl-1.15/

Dhingra, Bhuwan and Cole, Jeremy R. and Eisenschlos, Julian Martin and Gillick, Daniel and Eisenstein, Jacob and Cohen, William W. , editor =. Time-. Transactions of the Association for Computational Linguistics , publisher =. 2022 , pages =. doi:10.1162/tacl_a_00459 , abstract =

work page doi:10.1162/tacl_a_00459 2022
[25]

and Ouni, Ali and Ishio, Takashi and Inoue, Katsuro , year=

Do. Empirical Software Engineering , author =. 2018 , note =. doi:10.1007/s10664-017-9521-5 , abstract =

work page doi:10.1007/s10664-017-9521-5 2018
[26]

OpenAI , month = apr, year =

Introducing. OpenAI , month = apr, year =
[27]

Team Gemini , month = may, year =. Gemini:. doi:10.48550/arXiv.2312.11805 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805
[28]

Qwen3.5-Omni Technical Report

Team, Qwen , month = apr, year =. Qwen3.5-. doi:10.48550/arXiv.2604.15804 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.15804
[29]

doi:10.48550/arXiv.2505.12891 , abstract =

Wei, Shaohang and Li, Wei and Song, Feifan and Luo, Wen and Zhuang, Tianyi and Tan, Haochen and Guo, Zhijiang and Wang, Houfeng , month = oct, year =. doi:10.48550/arXiv.2505.12891 , abstract =

work page doi:10.48550/arxiv.2505.12891
[30]

Proceedings of the 63rd

Zhu, Zhiyuan and Liao, Yusheng and Chen, Zhe and Wang, Yuhao and Guan, Yunfeng and Wang, Yanfeng and Wang, Yu , editor =. Proceedings of the 63rd. 2025 , pages =. doi:10.18653/v1/2025.acl-long.788 , abstract =

work page doi:10.18653/v1/2025.acl-long.788 2025
[31]

doi:10.48550/arXiv.2406.07411 , abstract =

Wu, Tongtong and Wu, Weigang and Wang, Xingyu and Xu, Kang and Ma, Suyu and Jiang, Bo and Yang, Ping and Xing, Zhenchang and Li, Yuan-Fang and Haffari, Gholamreza , month = oct, year =. doi:10.48550/arXiv.2406.07411 , abstract =

work page doi:10.48550/arxiv.2406.07411
[32]

Liu, Zeyu Leo and Pandit, Shrey and Ye, Xi and Choi, Eunsol and Durrett, Greg , month = oct, year =
[33]

Efficient Training of Language Models to Fill in the Middle

Bavarian, Mohammad and Jun, Heewoo and Tezak, Nikolas and Schulman, John and McLeavey, Christine and Tworek, Jerry and Chen, Mark , month = jul, year =. Efficient. doi:10.48550/arXiv.2207.14255 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2207.14255
[34]

Bogomolov, Egor and Eliseeva, Aleksandra and Galimzyanov, Timur and Glukhov, Evgeniy and Shapkin, Anton and Tigina, Maria and Golubev, Yaroslav and Kovrigin, Alexander and Deursen, Arie van and Izadi, Maliheh and Bryksin, Timofey , month = jun, year =. Long. doi:10.48550/arXiv.2406.11612 , abstract =

work page doi:10.48550/arxiv.2406.11612
[35]

Survey of hallucination in natural language generation,

Survey of. ACM Computing Surveys , author =. 2023 , note =. doi:10.1145/3571730 , abstract =

work page doi:10.1145/3571730 2023
[36]

doi:10.48550/arXiv.2406.09834 , abstract =

Wang, Chong and Huang, Kaifeng and Zhang, Jian and Feng, Yebo and Zhang, Lyuye and Liu, Yang and Peng, Xin , month = feb, year =. doi:10.48550/arXiv.2406.09834 , abstract =

work page doi:10.48550/arxiv.2406.09834
[37]

2025 , issue_date =

A. ACM Transactions on Information Systems , author =. 2025 , note =. doi:10.1145/3703155 , abstract =

work page doi:10.1145/3703155 2025
[38]

Jain, Nihal and Kwiatkowski, Robert and Ray, Baishakhi and Ramanathan, Murali Krishna and Kumar, Varun , month = jul, year =. On. doi:10.48550/arXiv.2407.09726 , abstract =

work page doi:10.48550/arxiv.2407.09726
[39]

Spracklen, Joseph and Wijewickrama, Raveen and Sakib, A. H. M. Nazmus and Maiti, Anindya and Viswanath, Bimal and Jadliwala, Murtuza , month = mar, year =. We. doi:10.48550/arXiv.2406.10279 , abstract =

work page doi:10.48550/arxiv.2406.10279
[40]

doi:10.48550/arXiv.2409.20550 , abstract =

Zhang, Ziyao and Wang, Yanlin and Wang, Chong and Chen, Jiachi and Zheng, Zibin , month = sep, year =. doi:10.48550/arXiv.2409.20550 , abstract =

work page doi:10.48550/arxiv.2409.20550
[41]

doi:10.48550/arXiv.2405.00253 , abstract =

Tian, Yuchen and Yan, Weixiang and Yang, Qian and Zhao, Xuandong and Chen, Qian and Wang, Wen and Luo, Ziyang and Ma, Lei and Song, Dawn , month = jan, year =. doi:10.48550/arXiv.2405.00253 , abstract =

work page doi:10.48550/arxiv.2405.00253
[42]

Liu, Fang and Liu, Yang and Shi, Lin and Yang, Zhen and Zhang, Li and Lian, Xiaoli and Li, Zhongqi and Ma, Yuchi , month = jan, year =. Beyond. doi:10.48550/arXiv.2404.00971 , abstract =

work page doi:10.48550/arxiv.2404.00971
[43]

Chen, Yujia and Chen, Mingyu and Gao, Cuiyun and Jiang, Zhihan and Li, Zhongqi and Ma, Yuchi , month = may, year =. Towards. doi:10.48550/arXiv.2505.05057 , abstract =

work page doi:10.48550/arxiv.2505.05057
[44]

Why Language Models Hallucinate

Kalai, Adam Tauman and Nachum, Ofir and Vempala, Santosh S. and Zhang, Edwin , month = sep, year =. Why. doi:10.48550/arXiv.2509.04664 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.04664
[45]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and Küttler, Heinrich and Lewis, Mike and Yih, Wen-tau and Rocktäschel, Tim and Riedel, Sebastian and Kiela, Douwe , month = apr, year =. Retrieval-. doi:10.48550/arXiv.2005.11401 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.11401 2005
[46]

arXiv.org , author =

Retrieval-. arXiv.org , author =
[47]

Retrieval-Augmented Generation for Large Language Models: A Survey

Gao, Yunfan and Xiong, Yun and Gao, Xinyu and Jia, Kangxiang and Pan, Jinliu and Bi, Yuxi and Dai, Yi and Sun, Jiawei and Wang, Meng and Wang, Haofen , month = mar, year =. Retrieval-. doi:10.48550/arXiv.2312.10997 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997
[48]

StarCoder 2 and The Stack v2: The Next Generation

Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and Liu, Tianyang and Tian, Max and Kocetkov, Denis and Zucker, Arthur and Belkada, Younes and Wang, Zijian and Liu, Qian and Abulkhanov, Dmitry and Paul, Indraneil and Li, Z...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.19173
[49]

Peng, Sida and Kalliamvakou, Eirini and Cihon, Peter and Demirer, Mert , month = feb, year =. The. doi:10.48550/arXiv.2302.06590 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.06590
[50]

Knowledge

Xu, Rongwu and Qi, Zehan and Guo, Zhijiang and Wang, Cunxiang and Wang, Hongru and Zhang, Yue and Xu, Wei , month = jun, year =. Knowledge. doi:10.48550/arXiv.2403.08319 , abstract =

work page doi:10.48550/arxiv.2403.08319
[51]

Adaptive

Xie, Jian and Zhang, Kai and Chen, Jiangjie and Lou, Renze and Su, Yu , month = feb, year =. Adaptive. doi:10.48550/arXiv.2305.13300 , abstract =

work page doi:10.48550/arxiv.2305.13300
[52]

TreeRanker: Fast and Model-agnostic Ranking System for Code Suggestions in IDEs

Cipollone, Daniele and Bogomolov, Egor and Deursen, Arie van and Izadi, Maliheh , month = aug, year =. doi:10.48550/arXiv.2508.02455 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.02455
[53]

Measuring. Commun. ACM , author =. 2024 , pages =. doi:10.1145/3633453 , abstract =

work page doi:10.1145/3633453 2024
[54]

Identifying and

Zhuo, Terry Yue and He, Junda and Sun, Jiamou and Xing, Zhenchang and Lo, David and Grundy, John and Du, Xiaoning , month = dec, year =. Identifying and. doi:10.48550/arXiv.2503.22821 , abstract =

work page doi:10.48550/arxiv.2503.22821
[55]

Ashik, Ahmed Nusayer and Wang, Shaowei and Chen, Tse-Hsun and Asaduzzaman, Muhammad and Tian, Yuan , month = apr, year =. When. doi:10.48550/arXiv.2604.09515 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.09515
[56]

and Rish, Irina and Kahou, Samira Ebrahimi and Caccia, Massimo , month = jul, year =

Misra, Diganta and Islah, Nizar and May, Victor and Rauby, Brice and Wang, Zihan and Gehring, Justine and Orvieto, Antonio and Chaudhary, Muawiz and Muller, Eilif B. and Rish, Irina and Kahou, Samira Ebrahimi and Caccia, Massimo , month = jul, year =. doi:10.48550/arXiv.2507.12367 , abstract =

work page doi:10.48550/arxiv.2507.12367
[57]

TimeMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks

Fujii, Ryo and Morishita, Makoto and Yano, Kazuki and Suzuki, Jun , month = jan, year =. doi:10.48550/arXiv.2601.22597 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.22597
[58]

Yang, Jian and Liu, Xianglong and Lv, Weifeng and Deng, Ken and Guo, Shawn and Jing, Lin and Li, Yizhi and Liu, Shark and Luo, Xianzhen and Luo, Yuyu and Pan, Changzai and Shi, Ensheng and Tan, Yingshui and Tao, Renshuai and Wu, Jiajun and Wu, Xianjie and Wu, Zhenhe and Zan, Daoguang and Zhang, Chenchen and Zhang, Wei and Zhu, He and Zhuo, Terry Yue and C...

work page doi:10.48550/arxiv.2511.18538
[59]

Pavlichenko, Nikita and Nazarov, Iurii and Dolgov, Ivan and Garanina, Ekaterina and Ustalov, Dmitry and Bondyrev, Ivan and Lysaniuk, Kseniia and Vu, Evgeniia and Chekmenev, Kirill and Shtok, Joseph and Golubev, Yaroslav and Semenkin, Anton and Sazanovich, Uladzislau , month = oct, year =. Mellum:. doi:10.48550/arXiv.2510.05788 , abstract =

work page doi:10.48550/arxiv.2510.05788
[60]

Zhang, Quanjun and Fang, Chunrong and Xie, Yang and Zhang, Yaxin and Yang, Yun and Sun, Weisong and Yu, Shengcheng and Chen, Zhenyu , month = sep, year =. A. doi:10.48550/arXiv.2312.15223 , abstract =

work page doi:10.48550/arxiv.2312.15223
[61]

doi:10.48550/arXiv.2412.04478 , abstract =

Kuhar, Sachit and Ahmad, Wasi Uddin and Wang, Zijian and Jain, Nihal and Qian, Haifeng and Ray, Baishakhi and Ramanathan, Murali Krishna and Ma, Xiaofei and Deoras, Anoop , month = nov, year =. doi:10.48550/arXiv.2412.04478 , abstract =

work page doi:10.48550/arxiv.2412.04478
[62]

doi:10.48550/arXiv.2502.16645 , abstract =

Wang, Chenlong and Chu, Zhaoyang and Cheng, Zhengxiang and Yang, Xuyi and Qiu, Kaiyue and Wan, Yao and Zhao, Zhou and Shi, Xuanhua and Chen, Dongping , month = jun, year =. doi:10.48550/arXiv.2502.16645 , abstract =

work page doi:10.48550/arxiv.2502.16645

[1] [1]

Introducing Sonnet 4.6 Anthropic , 2026

Anthropic . Introducing Sonnet 4.6 Anthropic , 2026. URL https://www.anthropic.com/news/claude-sonnet-4-6

2026

[2] [2]

v., Izadi, M., and Bryksin, T

Bogomolov, E., Eliseeva, A., Galimzyanov, T., Glukhov, E., Shapkin, A., Tigina, M., Golubev, Y., Kovrigin, A., Deursen, A. v., Izadi, M., and Bryksin, T. Long Code Arena : a Set of Benchmarks for Long - Context Code Models , June 2024. URL http://arxiv.org/abs/2406.11612. arXiv:2406.11612 [cs]

arXiv 2024

[3] [3]

Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware , May 2025

Chen, Y., Chen, M., Gao, C., Jiang, Z., Li, Z., and Ma, Y. Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware , May 2025. URL http://arxiv.org/abs/2505.05057. arXiv:2505.05057 [cs]

arXiv 2025

[4] [5]

Gemini: A Family of Highly Capable Multimodal Models , May 2025

Gemini, T. Gemini: A Family of Highly Capable Multimodal Models , May 2025. URL http://arxiv.org/abs/2312.11805. arXiv:2312.11805 [cs]

Pith/arXiv arXiv 2025

[5] [6]

K., and Kumar, V

Jain, N., Kwiatkowski, R., Ray, B., Ramanathan, M. K., and Kumar, V. On Mitigating Code LLM Hallucinations with API Documentation , July 2024. URL http://arxiv.org/abs/2407.09726. arXiv:2407.09726 [cs]

arXiv 2024

[6] [7]

U., Wang, Z., Jain, N., Qian, H., Ray, B., Ramanathan, M

Kuhar, S., Ahmad, W. U., Wang, Z., Jain, N., Qian, H., Ray, B., Ramanathan, M. K., Ma, X., and Deoras, A. LibEvolutionEval : A Benchmark and Study for Version - Specific Code Generation , November 2024. URL http://arxiv.org/abs/2412.04478. arXiv:2412.04478 [cs]

arXiv 2024

[7] [9]

Beyond Functional Correctness : Exploring Hallucinations in LLM - Generated Code , January 2026

Liu, F., Liu, Y., Shi, L., Yang, Z., Zhang, L., Lian, X., Li, Z., and Ma, Y. Beyond Functional Correctness : Exploring Hallucinations in LLM - Generated Code , January 2026. URL http://arxiv.org/abs/2404.00971. arXiv:2404.00971 [cs] version: 3

arXiv 2026

[8] [10]

L., Pandit, S., Ye, X., Choi, E., and Durrett, G

Liu, Z. L., Pandit, S., Ye, X., Choi, E., and Durrett, G. CodeUpdateArena : Benchmarking Knowledge Editing on API Updates . October 2024. URL https://openreview.net/forum?id=ecRyUAPshY

2024

[9] [11]

Lozhkov, A., Li, R., Allal, L. B., Cassano, F., Lamy-Poirier, J., Tazi, N., Tang, A., Pykhtar, D., Liu, J., Wei, Y., Liu, T., Tian, M., Kocetkov, D., Zucker, A., Belkada, Y., Wang, Z., Liu, Q., Abulkhanov, D., Paul, I., Li, Z., Li, W.-D., Risdal, M., Li, J., Zhu, J., Zhuo, T. Y., Zheltonozhskii, E., Dade, N. O. O., Yu, W., Krauß, L., Jain, N., Su, Y., He,...

Pith/arXiv arXiv 2024

[10] [12]

B., Rish, I., Kahou, S

Misra, D., Islah, N., May, V., Rauby, B., Wang, Z., Gehring, J., Orvieto, A., Chaudhary, M., Muller, E. B., Rish, I., Kahou, S. E., and Caccia, M. GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities , July 2025. URL http://arxiv.org/abs/2507.12367. arXiv:2507.12367 [cs]

arXiv 2025

[11] [13]

Introducing GPT -5.5, April 2026

OpenAI . Introducing GPT -5.5, April 2026. URL https://openai.com/index/introducing-gpt-5-5/

2026

[12] [14]

Spracklen, J., Wijewickrama, R., Sakib, A. H. M. N., Maiti, A., Viswanath, B., and Jadliwala, M. We Have a Package for You ! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs , March 2025. URL http://arxiv.org/abs/2406.10279. arXiv:2406.10279 [cs]

arXiv 2025

[13] [15]

Qwen3.5- Omni Technical Report , April 2026

Team, Q. Qwen3.5- Omni Technical Report , April 2026. URL http://arxiv.org/abs/2604.15804. arXiv:2604.15804 [cs]

Pith/arXiv arXiv 2026

[14] [16]

CodeHalu : Investigating Code Hallucinations in LLMs via Execution -based Verification , January 2025

Tian, Y., Yan, W., Yang, Q., Zhao, X., Chen, Q., Wang, W., Luo, Z., Ma, L., and Song, D. CodeHalu : Investigating Code Hallucinations in LLMs via Execution -based Verification , January 2025. URL http://arxiv.org/abs/2405.00253. arXiv:2405.00253 [cs]

arXiv 2025

[15] [17]

LLMs Meet Library Evolution : Evaluating Deprecated API Usage in LLM -based Code Completion , February 2025

Wang, C., Huang, K., Zhang, J., Feng, Y., Zhang, L., Liu, Y., and Peng, X. LLMs Meet Library Evolution : Evaluating Deprecated API Usage in LLM -based Code Completion , February 2025. URL http://arxiv.org/abs/2406.09834. arXiv:2406.09834 [cs]

arXiv 2025

[16] [18]

TIME : A Multi -level Benchmark for Temporal Reasoning of LLMs in Real - World Scenarios , October 2025

Wei, S., Li, W., Song, F., Luo, W., Zhuang, T., Tan, H., Guo, Z., and Wang, H. TIME : A Multi -level Benchmark for Temporal Reasoning of LLMs in Real - World Scenarios , October 2025. URL http://arxiv.org/abs/2505.12891. arXiv:2505.12891 [cs]

arXiv 2025

[17] [19]

VersiCode : Towards Version -controllable Code Generation , October 2024

Wu, T., Wu, W., Wang, X., Xu, K., Ma, S., Jiang, B., Yang, P., Xing, Z., Li, Y.-F., and Haffari, G. VersiCode : Towards Version -controllable Code Generation , October 2024. URL http://arxiv.org/abs/2406.07411. arXiv:2406.07411 [cs]

arXiv 2024

[18] [20]

LLM Hallucinations in Practical Code Generation : Phenomena , Mechanism , and Mitigation , September 2024

Zhang, Z., Wang, Y., Wang, C., Chen, J., and Zheng, Z. LLM Hallucinations in Practical Code Generation : Phenomena , Mechanism , and Mitigation , September 2024. URL http://arxiv.org/abs/2409.20550. arXiv:2409.20550 [cs] version: 1

arXiv 2024

[19] [21]

Zhao, B., Brumbaugh, Z., Wang, Y., Hajishirzi, H., and Smith, N. A. Set the Clock : Temporal Alignment of Pretrained Language Models , June 2024. URL http://arxiv.org/abs/2402.16797. arXiv:2402.16797 [cs]

arXiv 2024

[20] [23]

, month = jun, year =

Zhao, Bowen and Brumbaugh, Zander and Wang, Yizhong and Hajishirzi, Hannaneh and Smith, Noah A. , month = jun, year =. Set the. doi:10.48550/arXiv.2402.16797 , abstract =

work page doi:10.48550/arxiv.2402.16797

[21] [24]

https://doi.org/10.1162/tacl_a_00459, https://aclanthology.org/2022.tacl-1.15/

Dhingra, Bhuwan and Cole, Jeremy R. and Eisenschlos, Julian Martin and Gillick, Daniel and Eisenstein, Jacob and Cohen, William W. , editor =. Time-. Transactions of the Association for Computational Linguistics , publisher =. 2022 , pages =. doi:10.1162/tacl_a_00459 , abstract =

work page doi:10.1162/tacl_a_00459 2022

[22] [25]

and Ouni, Ali and Ishio, Takashi and Inoue, Katsuro , year=

Do. Empirical Software Engineering , author =. 2018 , note =. doi:10.1007/s10664-017-9521-5 , abstract =

work page doi:10.1007/s10664-017-9521-5 2018

[23] [26]

OpenAI , month = apr, year =

Introducing. OpenAI , month = apr, year =

[24] [27]

Team Gemini , month = may, year =. Gemini:. doi:10.48550/arXiv.2312.11805 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805

[25] [28]

Qwen3.5-Omni Technical Report

Team, Qwen , month = apr, year =. Qwen3.5-. doi:10.48550/arXiv.2604.15804 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.15804

[26] [29]

doi:10.48550/arXiv.2505.12891 , abstract =

Wei, Shaohang and Li, Wei and Song, Feifan and Luo, Wen and Zhuang, Tianyi and Tan, Haochen and Guo, Zhijiang and Wang, Houfeng , month = oct, year =. doi:10.48550/arXiv.2505.12891 , abstract =

work page doi:10.48550/arxiv.2505.12891

[27] [30]

Proceedings of the 63rd

Zhu, Zhiyuan and Liao, Yusheng and Chen, Zhe and Wang, Yuhao and Guan, Yunfeng and Wang, Yanfeng and Wang, Yu , editor =. Proceedings of the 63rd. 2025 , pages =. doi:10.18653/v1/2025.acl-long.788 , abstract =

work page doi:10.18653/v1/2025.acl-long.788 2025

[28] [31]

doi:10.48550/arXiv.2406.07411 , abstract =

Wu, Tongtong and Wu, Weigang and Wang, Xingyu and Xu, Kang and Ma, Suyu and Jiang, Bo and Yang, Ping and Xing, Zhenchang and Li, Yuan-Fang and Haffari, Gholamreza , month = oct, year =. doi:10.48550/arXiv.2406.07411 , abstract =

work page doi:10.48550/arxiv.2406.07411

[29] [32]

Liu, Zeyu Leo and Pandit, Shrey and Ye, Xi and Choi, Eunsol and Durrett, Greg , month = oct, year =

[30] [33]

Efficient Training of Language Models to Fill in the Middle

Bavarian, Mohammad and Jun, Heewoo and Tezak, Nikolas and Schulman, John and McLeavey, Christine and Tworek, Jerry and Chen, Mark , month = jul, year =. Efficient. doi:10.48550/arXiv.2207.14255 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2207.14255

[31] [34]

Bogomolov, Egor and Eliseeva, Aleksandra and Galimzyanov, Timur and Glukhov, Evgeniy and Shapkin, Anton and Tigina, Maria and Golubev, Yaroslav and Kovrigin, Alexander and Deursen, Arie van and Izadi, Maliheh and Bryksin, Timofey , month = jun, year =. Long. doi:10.48550/arXiv.2406.11612 , abstract =

work page doi:10.48550/arxiv.2406.11612

[32] [35]

Survey of hallucination in natural language generation,

Survey of. ACM Computing Surveys , author =. 2023 , note =. doi:10.1145/3571730 , abstract =

work page doi:10.1145/3571730 2023

[33] [36]

doi:10.48550/arXiv.2406.09834 , abstract =

Wang, Chong and Huang, Kaifeng and Zhang, Jian and Feng, Yebo and Zhang, Lyuye and Liu, Yang and Peng, Xin , month = feb, year =. doi:10.48550/arXiv.2406.09834 , abstract =

work page doi:10.48550/arxiv.2406.09834

[34] [37]

2025 , issue_date =

A. ACM Transactions on Information Systems , author =. 2025 , note =. doi:10.1145/3703155 , abstract =

work page doi:10.1145/3703155 2025

[35] [38]

Jain, Nihal and Kwiatkowski, Robert and Ray, Baishakhi and Ramanathan, Murali Krishna and Kumar, Varun , month = jul, year =. On. doi:10.48550/arXiv.2407.09726 , abstract =

work page doi:10.48550/arxiv.2407.09726

[36] [39]

Spracklen, Joseph and Wijewickrama, Raveen and Sakib, A. H. M. Nazmus and Maiti, Anindya and Viswanath, Bimal and Jadliwala, Murtuza , month = mar, year =. We. doi:10.48550/arXiv.2406.10279 , abstract =

work page doi:10.48550/arxiv.2406.10279

[37] [40]

doi:10.48550/arXiv.2409.20550 , abstract =

Zhang, Ziyao and Wang, Yanlin and Wang, Chong and Chen, Jiachi and Zheng, Zibin , month = sep, year =. doi:10.48550/arXiv.2409.20550 , abstract =

work page doi:10.48550/arxiv.2409.20550

[38] [41]

doi:10.48550/arXiv.2405.00253 , abstract =

Tian, Yuchen and Yan, Weixiang and Yang, Qian and Zhao, Xuandong and Chen, Qian and Wang, Wen and Luo, Ziyang and Ma, Lei and Song, Dawn , month = jan, year =. doi:10.48550/arXiv.2405.00253 , abstract =

work page doi:10.48550/arxiv.2405.00253

[39] [42]

Liu, Fang and Liu, Yang and Shi, Lin and Yang, Zhen and Zhang, Li and Lian, Xiaoli and Li, Zhongqi and Ma, Yuchi , month = jan, year =. Beyond. doi:10.48550/arXiv.2404.00971 , abstract =

work page doi:10.48550/arxiv.2404.00971

[40] [43]

Chen, Yujia and Chen, Mingyu and Gao, Cuiyun and Jiang, Zhihan and Li, Zhongqi and Ma, Yuchi , month = may, year =. Towards. doi:10.48550/arXiv.2505.05057 , abstract =

work page doi:10.48550/arxiv.2505.05057

[41] [44]

Why Language Models Hallucinate

Kalai, Adam Tauman and Nachum, Ofir and Vempala, Santosh S. and Zhang, Edwin , month = sep, year =. Why. doi:10.48550/arXiv.2509.04664 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.04664

[42] [45]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and Küttler, Heinrich and Lewis, Mike and Yih, Wen-tau and Rocktäschel, Tim and Riedel, Sebastian and Kiela, Douwe , month = apr, year =. Retrieval-. doi:10.48550/arXiv.2005.11401 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.11401 2005

[43] [46]

arXiv.org , author =

Retrieval-. arXiv.org , author =

[44] [47]

Retrieval-Augmented Generation for Large Language Models: A Survey

Gao, Yunfan and Xiong, Yun and Gao, Xinyu and Jia, Kangxiang and Pan, Jinliu and Bi, Yuxi and Dai, Yi and Sun, Jiawei and Wang, Meng and Wang, Haofen , month = mar, year =. Retrieval-. doi:10.48550/arXiv.2312.10997 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997

[45] [48]

StarCoder 2 and The Stack v2: The Next Generation

Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and Liu, Tianyang and Tian, Max and Kocetkov, Denis and Zucker, Arthur and Belkada, Younes and Wang, Zijian and Liu, Qian and Abulkhanov, Dmitry and Paul, Indraneil and Li, Z...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.19173

[46] [49]

Peng, Sida and Kalliamvakou, Eirini and Cihon, Peter and Demirer, Mert , month = feb, year =. The. doi:10.48550/arXiv.2302.06590 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.06590

[47] [50]

Knowledge

Xu, Rongwu and Qi, Zehan and Guo, Zhijiang and Wang, Cunxiang and Wang, Hongru and Zhang, Yue and Xu, Wei , month = jun, year =. Knowledge. doi:10.48550/arXiv.2403.08319 , abstract =

work page doi:10.48550/arxiv.2403.08319

[48] [51]

Adaptive

Xie, Jian and Zhang, Kai and Chen, Jiangjie and Lou, Renze and Su, Yu , month = feb, year =. Adaptive. doi:10.48550/arXiv.2305.13300 , abstract =

work page doi:10.48550/arxiv.2305.13300

[49] [52]

TreeRanker: Fast and Model-agnostic Ranking System for Code Suggestions in IDEs

Cipollone, Daniele and Bogomolov, Egor and Deursen, Arie van and Izadi, Maliheh , month = aug, year =. doi:10.48550/arXiv.2508.02455 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.02455

[50] [53]

Measuring. Commun. ACM , author =. 2024 , pages =. doi:10.1145/3633453 , abstract =

work page doi:10.1145/3633453 2024

[51] [54]

Identifying and

Zhuo, Terry Yue and He, Junda and Sun, Jiamou and Xing, Zhenchang and Lo, David and Grundy, John and Du, Xiaoning , month = dec, year =. Identifying and. doi:10.48550/arXiv.2503.22821 , abstract =

work page doi:10.48550/arxiv.2503.22821

[52] [55]

Ashik, Ahmed Nusayer and Wang, Shaowei and Chen, Tse-Hsun and Asaduzzaman, Muhammad and Tian, Yuan , month = apr, year =. When. doi:10.48550/arXiv.2604.09515 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.09515

[53] [56]

and Rish, Irina and Kahou, Samira Ebrahimi and Caccia, Massimo , month = jul, year =

Misra, Diganta and Islah, Nizar and May, Victor and Rauby, Brice and Wang, Zihan and Gehring, Justine and Orvieto, Antonio and Chaudhary, Muawiz and Muller, Eilif B. and Rish, Irina and Kahou, Samira Ebrahimi and Caccia, Massimo , month = jul, year =. doi:10.48550/arXiv.2507.12367 , abstract =

work page doi:10.48550/arxiv.2507.12367

[54] [57]

TimeMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks

Fujii, Ryo and Morishita, Makoto and Yano, Kazuki and Suzuki, Jun , month = jan, year =. doi:10.48550/arXiv.2601.22597 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.22597

[55] [58]

Yang, Jian and Liu, Xianglong and Lv, Weifeng and Deng, Ken and Guo, Shawn and Jing, Lin and Li, Yizhi and Liu, Shark and Luo, Xianzhen and Luo, Yuyu and Pan, Changzai and Shi, Ensheng and Tan, Yingshui and Tao, Renshuai and Wu, Jiajun and Wu, Xianjie and Wu, Zhenhe and Zan, Daoguang and Zhang, Chenchen and Zhang, Wei and Zhu, He and Zhuo, Terry Yue and C...

work page doi:10.48550/arxiv.2511.18538

[56] [59]

Pavlichenko, Nikita and Nazarov, Iurii and Dolgov, Ivan and Garanina, Ekaterina and Ustalov, Dmitry and Bondyrev, Ivan and Lysaniuk, Kseniia and Vu, Evgeniia and Chekmenev, Kirill and Shtok, Joseph and Golubev, Yaroslav and Semenkin, Anton and Sazanovich, Uladzislau , month = oct, year =. Mellum:. doi:10.48550/arXiv.2510.05788 , abstract =

work page doi:10.48550/arxiv.2510.05788

[57] [60]

Zhang, Quanjun and Fang, Chunrong and Xie, Yang and Zhang, Yaxin and Yang, Yun and Sun, Weisong and Yu, Shengcheng and Chen, Zhenyu , month = sep, year =. A. doi:10.48550/arXiv.2312.15223 , abstract =

work page doi:10.48550/arxiv.2312.15223

[58] [61]

doi:10.48550/arXiv.2412.04478 , abstract =

Kuhar, Sachit and Ahmad, Wasi Uddin and Wang, Zijian and Jain, Nihal and Qian, Haifeng and Ray, Baishakhi and Ramanathan, Murali Krishna and Ma, Xiaofei and Deoras, Anoop , month = nov, year =. doi:10.48550/arXiv.2412.04478 , abstract =

work page doi:10.48550/arxiv.2412.04478

[59] [62]

doi:10.48550/arXiv.2502.16645 , abstract =

Wang, Chenlong and Chu, Zhaoyang and Cheng, Zhengxiang and Yang, Xuyi and Qiu, Kaiyue and Wan, Yao and Zhao, Zhou and Shi, Xuanhua and Chen, Dongping , month = jun, year =. doi:10.48550/arXiv.2502.16645 , abstract =

work page doi:10.48550/arxiv.2502.16645