Towards Knowledge Alignment in Code LLMs: Contrastive Unlearning for Evolving APIs

Anh H. D. Nguyen; Anh M. T. Bui; Anh N. H. Vu; Dang H. Vu; Huy Q. Tran; Phuong T. Nguyen; Tuyen N. Dinh

arxiv: 2606.30810 · v1 · pith:XFX5NIVBnew · submitted 2026-06-29 · 💻 cs.SE

Towards Knowledge Alignment in Code LLMs: Contrastive Unlearning for Evolving APIs

Huy Q. Tran , Dang H. Vu , Tuyen N. Dinh , Anh H. D. Nguyen , Anh N. H. Vu , Anh M. T. Bui , Phuong T. Nguyen This is my paper

Pith reviewed 2026-07-01 01:51 UTC · model grok-4.3

classification 💻 cs.SE

keywords code generationlarge language modelsdeprecated APIsmachine unlearningcontrastive learningAPI migrationsoftware evolution

0 comments

The pith

CURE uses contrastive unlearning to steer code LLMs away from deprecated APIs toward valid replacements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CURE to address how code-generating LLMs produce outdated API calls due to fixed training data and fast-changing libraries. Pure suppression methods reduce old usages but often leave models generating mismatched or incomplete code. CURE instead applies contrastive signals that simultaneously discourage deprecated APIs and promote correct alternatives. Experiments on a recent benchmark show this dual approach lowers deprecated usage, raises correct replacement rates, and leaves general code generation intact. It also beats two existing baselines across quality measures.

Core claim

CURE is a contrastive unlearning method that jointly discourages deprecated APIs while encouraging their valid alternatives, enabling more reliable adaptation to evolving software libraries than methods that only suppress outdated knowledge.

What carries the argument

CURE, the contrastive unlearning approach that shifts from pure suppression of outdated knowledge to explicitly promoting correct API replacements.

If this is right

Reduces deprecated API usage in generated code.
Increases the frequency of correct API replacements.
Preserves general code generation performance on tasks unrelated to the updated APIs.
Outperforms two state-of-the-art baselines on multiple quality metrics for the adaptation task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Contrastive unlearning could extend to other targeted knowledge updates in LLMs, such as correcting factual errors or domain-specific terminology.
The dual suppression-plus-replacement pattern may lower the frequency of full retraining needed when libraries evolve.
Testing CURE on larger models or additional languages would reveal whether the observed steering effect scales beyond the current benchmark.

Load-bearing premise

The assumption that contrastive signals can reliably steer models toward correct replacements without introducing new mismatches or degrading unrelated capabilities, tested only on the referenced benchmark dataset.

What would settle it

If, on a new collection of deprecated APIs not seen in the original benchmark, CURE produces more incomplete or erroneous generations than a pure-suppression baseline, the steering benefit would be falsified.

Figures

Figures reproduced from arXiv: 2606.30810 by Anh H. D. Nguyen, Anh M. T. Bui, Anh N. H. Vu, Dang H. Vu, Huy Q. Tran, Phuong T. Nguyen, Tuyen N. Dinh.

**Figure 1.** Figure 1: Generation Behavior Distribution on Df across Unlearning Methods. 4.9 56.3 38.8 73.4 25.9 64.0 35.6 59.7 38.2 44.1 54.6 41.4 54.0 51.9 58.1 44.6 42.1 40.6 55.2 48.9 49.5 34.2 43.2 57.3 55.2 55.3 43.9 53.3 43.6 58.3 40.9 40.1 48.3 42.3 52.7 56.4 46.5 51.8 56.7 44.7 0.7 0.4 2.1 1.3 6.7 1.4 1.3 2.7 1.6 8.5 1.5 0.9 3.1 1.6 10.8 1.3 0.8 3.5 2.6 40.7 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Generation Behavior Distribution on Up-to-date Contexts [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have recently achieved strong performance in code generation. However, due to knowledge cut-off and the rapid evolution of software libraries, they often generate deprecated API usages that lead to unreliable and incompatible code. Existing fine-tuning methods lack selectivity when only a small portion of model knowledge requires modification. Recent model-level approaches, such as machine unlearning and model editing, offer a promising direction for modifying parametric knowledge. However, their use for deprecated API mitigation remains largely unexplored. Moreover, existing methods primarily suppress outdated APIs, but do not explicitly steer models toward correct replacements, often leading to mismatched or incomplete generations. To address this limitation, we developed CURE, a contrastive unlearning approach that shifts unlearning from purely suppressing outdated knowledge to explicitly promoting correct API replacements. Concretely, CURE jointly discourages deprecated APIs while encouraging their valid alternatives, enabling more reliable adaptation to evolving software libraries. The experiments on recent deprecated API benchmark dataset show that CURE not only reduces deprecated API usage but also improves correct API replacement, while preserving general code generation performance. CURE substantially outperforms two SOTA baselines with respect to different quality metrics. These findings highlight the importance of combining suppression with replacement when adapting LLMs to evolving software ecosystems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CURE extends unlearning by adding explicit contrastive promotion of replacement APIs, but the abstract supplies zero experimental details so the performance claims cannot be checked.

read the letter

The main takeaway is that this paper introduces CURE, a contrastive unlearning method meant to handle deprecated APIs in code LLMs by discouraging old calls while actively steering toward valid replacements. That is a clear step beyond the suppression-only baselines it cites.

The work identifies a practical pain point: LLMs trained on older data keep emitting outdated library calls, and pure unlearning can leave the model guessing at replacements. Adding a positive contrastive signal is a logical response to that gap, and the abstract positions the method as preserving general code generation while improving replacement accuracy on their benchmark.

The soft spots are straightforward. The abstract asserts outperformance over two SOTA baselines and no loss in general performance, yet it contains no loss formulation, no description of how positive and negative pairs are built, no metrics, no controls, and no results on auxiliary suites such as HumanEval or MBPP. A single deprecated-API benchmark is too narrow to support the claim that the contrastive signal produces reliable steering without side effects. Without those details the central result stays unverified.

This paper is aimed at researchers and practitioners working on reliable code generation in production settings where libraries evolve. Readers who care about model editing for software engineering tasks could extract the core idea even if the experiments need strengthening.

I would send it to peer review. The problem is real, the contrastive framing is new relative to the cited unlearning work, and a full manuscript with proper ablations and multi-benchmark results could be worth referee time. Based on the abstract alone the evidence is too thin to cite or to bring to a reading group.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes CURE, a contrastive unlearning method for code LLMs that jointly discourages deprecated API usages and encourages valid replacements to address knowledge cutoffs in evolving software libraries. On a deprecated API benchmark dataset, the authors claim CURE reduces deprecated API usage, improves correct API replacement rates, preserves general code generation performance, and substantially outperforms two SOTA baselines across multiple quality metrics.

Significance. If the empirical claims hold with proper controls and auxiliary benchmarks, the work would be significant for practical maintenance of code LLMs in dynamic environments. The shift from pure suppression to contrastive promotion of replacements is a targeted idea that could influence knowledge alignment techniques more broadly. The emphasis on selectivity when only a small portion of knowledge needs updating is a useful framing, though the single-benchmark scope limits immediate impact.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: The central claims of outperformance, improved correct replacements, and preservation of general performance are asserted without any reported metrics, tables, statistical tests, dataset details, or controls. This is load-bearing because the headline result cannot be evaluated from the text.
[Method / Experiments] Method and Experiments sections: No description is given of how positive/negative pairs are constructed for the contrastive signals, the exact form of the contrastive loss, or any auxiliary benchmarks (e.g., HumanEval, MBPP) used to verify that unrelated capabilities remain unchanged. Without these, the assumption that joint discouragement and encouragement produces reliable steering without new mismatches cannot be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on clarity and completeness. We address each point below and will revise the manuscript to incorporate the requested details and metrics.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: The central claims of outperformance, improved correct replacements, and preservation of general performance are asserted without any reported metrics, tables, statistical tests, dataset details, or controls. This is load-bearing because the headline result cannot be evaluated from the text.

Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised version, we will update the abstract to report key metrics from the experiments (e.g., deprecated API usage reduction rates, correct replacement improvements, and general performance scores), reference the relevant tables, and note dataset details and any statistical tests performed. The Experiments section will be expanded to ensure all controls and auxiliary results are clearly presented. revision: yes
Referee: [Method / Experiments] Method and Experiments sections: No description is given of how positive/negative pairs are constructed for the contrastive signals, the exact form of the contrastive loss, or any auxiliary benchmarks (e.g., HumanEval, MBPP) used to verify that unrelated capabilities remain unchanged. Without these, the assumption that joint discouragement and encouragement produces reliable steering without new mismatches cannot be assessed.

Authors: We will revise the Method section to explicitly describe the construction of positive and negative pairs for the contrastive signals, provide the precise mathematical form of the contrastive loss, and detail the auxiliary benchmarks (including HumanEval and MBPP) along with results showing preservation of unrelated capabilities. This will allow readers to evaluate the selectivity of the updates. revision: yes

Circularity Check

0 steps flagged

No circularity detected; conceptual method proposal with external benchmark evaluation

full rationale

The paper proposes CURE as a contrastive unlearning technique at a descriptive level, with no equations, derivations, or mathematical chains present in the abstract or described method. Claims rest on experimental results from a referenced benchmark dataset rather than any self-referential fitting, self-definition of terms, or load-bearing self-citations that reduce the result to its inputs by construction. No steps match the enumerated circularity patterns, and the approach is presented as an independent combination of suppression and promotion signals.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described in the provided text.

axioms (1)

domain assumption Parametric knowledge in LLMs can be selectively modified through unlearning techniques without full retraining
Implicit foundation for applying machine unlearning to API deprecation.

pith-pipeline@v0.9.1-grok · 5787 in / 1092 out tokens · 37607 ms · 2026-07-01T01:51:06.624332+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 11 canonical work pages · 6 internal anchors

[1]

Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification,

F. Mu, L. Shi, S. Wang, Z. Yu, B. Zhang, C. Wang, S. Liu, and Q. Wang, “Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification,”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 2332–2354, 2024

2024
[2]

Repairagent: An autonomous, llm-based agent for program repair,

I. Bouzenia, P. Devanbu, and M. Pradel, “Repairagent: An autonomous, llm-based agent for program repair,” in2025 IEEE/ACM 47th Interna- tional Conference on Software Engineering (ICSE). IEEE, 2025, pp. 2188–2200

2025
[3]

Mutation-guided llm-based test generation at meta,

M. Harman, J. Ritchey, I. Harper, S. Sengupta, K. Mao, A. Gulati, C. Foster, and H. Robert, “Mutation-guided llm-based test generation at meta,” inProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, 2025, pp. 180–191

2025
[4]

Can llms replace human evaluators? an empirical study of llm-as-a-judge in soft- ware engineering,

R. Wang, J. Guo, C. Gao, G. Fan, C. Y . Chong, and X. Xia, “Can llms replace human evaluators? an empirical study of llm-as-a-judge in soft- ware engineering,”Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 1955–1977, 2025

1955
[5]

Llm-as-a-judge for software engineering: Literature review, vision, and the road ahead,

J. He, J. Shi, T. Y . Zhuo, C. Treude, J. Sun, Z. Xing, X. Du, and D. Lo, “Llm-as-a-judge for software engineering: Literature review, vision, and the road ahead,”ACM Transactions on Software Engineering and Methodology, 2026

2026
[6]

Large language models for software engineering: Sur- vey and open problems,

A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Sur- vey and open problems,” in2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 2023, pp. 31–53

2023
[7]

Llms meet library evolution: Evaluating deprecated api usage in llm- based code completion,

C. Wang, K. Huang, J. Zhang, Y . Feng, L. Zhang, Y . Liu, and X. Peng, “Llms meet library evolution: Evaluating deprecated api usage in llm- based code completion,” in2025 ieee/acm 47th international conference on software engineering (icse). IEEE, 2025, pp. 885–897

2025
[8]

Is your llm outdated? a deep look at temporal generalization,

C. Zhu, N. Chen, Y . Gao, Y . Zhang, P. Tiwari, and B. Wang, “Is your llm outdated? a deep look at temporal generalization,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 7433–7457

2025
[9]

Humanevo: An evolution-aware benchmark for more realistic evalu- ation of repository-level code generation,

D. Zheng, Y . Wang, E. Shi, R. Zhang, Y . Ma, H. Zhang, and Z. Zheng, “Humanevo: An evolution-aware benchmark for more realistic evalu- ation of repository-level code generation,” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 2025, pp. 1372–1384

2025
[10]

Libevolutioneval: A benchmark and study for version-specific code generation,

S. Kuhar, W. Ahmad, Z. Wang, N. Jain, H. Qian, B. Ray, M. K. Ra- manathan, X. Ma, and A. Deoras, “Libevolutioneval: A benchmark and study for version-specific code generation,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2...

2025
[11]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-K. Dombrowski, S. Goel, L. Phanet al., “The wmdp benchmark: Measuring and reducing malicious use with unlearning,”arXiv preprint arXiv:2403.03218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Model editing for llms4code: How far are we?

X. Li, S. Wang, S. Li, J. Ma, J. Yu, X. Liu, J. Wang, B. Ji, and W. Zhang, “Model editing for llms4code: How far are we?” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 2025, pp. 937–949

2025
[13]

Lightweight model editing for llms to correct deprecated api recommendations,

G. Lin, X. Yu, J. Keung, X. Hu, X. Xia, and A. X. Liu, “Lightweight model editing for llms to correct deprecated api recommendations,” arXiv preprint arXiv:2511.21022, 2025

work page arXiv 2025
[15]

Large language model unlearning for source code,

X. Jiang, Y . Dong, H. Zhang, T. Wang, Z. Fang, Y . Ma, R. Cao, B. Li, Z. Jin, W. Jiaoet al., “Large language model unlearning for source code,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 37, 2026, pp. 31 346–31 355

2026
[16]

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning,

Z. Chu, Y . Wan, Z. Zhang, D. Wang, Z. Yang, H. Zhang, P. Zhou, X. Shi, H. Jin, and D. Lo, “Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning,” inProceedings of the 48th IEEE/ACM International Conference on Software Engineering, April 2026

2026
[17]

Regulation (EU) 2016/679 of the European Parliament and of the Council

C. Fan, J. Liu, L. Lin, J. Jia, R. Zhang, S. Mei, and S. Liu, “Simplicity prevails: Rethinking negative preference optimization for llm unlearn- ing,”arXiv preprint arXiv:2410.07163, 2024

work page arXiv 2024
[18]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . Liet al., “Deepseek-coder: when the large language model meets programming–the rise of code intelligence,”arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

StarCoder 2 and The Stack v2: The Next Generation

A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y . Weiet al., “Starcoder 2 and the stack v2: The next generation,”arXiv preprint arXiv:2402.19173, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Code Llama: Open Foundation Models for Code

B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remezet al., “Code llama: Open foundation models for code,”arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Code- gen2: Lessons for training llms on programming and natural languages,

E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y . Zhou, “Code- gen2: Lessons for training llms on programming and natural languages,” arXiv preprint arXiv:2305.02309, 2023

work page arXiv 2023
[22]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Towards knowledge alignment in code llms: Contrastive unlearning for evolving apis?

Anonymous, “Towards knowledge alignment in code llms: Contrastive unlearning for evolving apis?” https://figshare.com/s/ a8303a2ce6755cf25b0b, 2026, accessed: May 12, 2026

2026
[24]

An empirical study on the impact of refactoring activities on evolving client-used apis,

R. G. Kula, A. Ouni, D. M. German, and K. Inoue, “An empirical study on the impact of refactoring activities on evolving client-used apis,”Inf. Softw. Technol., vol. 93, no. C, p. 186–199, Jan. 2018. [Online]. Available: https://doi.org/10.1016/j.infsof.2017.09.007

work page doi:10.1016/j.infsof.2017.09.007 2018
[25]

2025.LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-Based Code Completion

C. Wang, K. Huang, J. Zhang, Y . Feng, L. Zhang, Y . Liu, and X. Peng, “Llms meet library evolution: Evaluating deprecated api usage in llm-based code completion,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering, ser. ICSE ’25. IEEE Press, 2025, p. 885–897. [Online]. Available: https://doi.org/10.1109/ICSE55347.2025.00245

work page doi:10.1109/icse55347.2025.00245 2025
[26]

Recode: Updating code api knowledge with reinforcement learning,

H. Wu, Y . Yao, W. Yu, and N. Zhang, “Recode: Updating code api knowledge with reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 40, 2026, pp. 33 908– 33 916

2026
[27]

Towards making systems forget with machine unlearning,

Y . Cao and J. Yang, “Towards making systems forget with machine unlearning,” in2015 IEEE symposium on security and privacy. IEEE, 2015, pp. 463–480

2015
[28]

Knowledge unlearning for mitigating privacy risks in language models,

J. Jang, D. Yoon, S. Yang, S. Cha, M. Lee, L. Logeswaran, and M. Seo, “Knowledge unlearning for mitigating privacy risks in language models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 14 389– 14 408

2023
[29]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in neural information processing systems, vol. 36, pp. 53 728–53 741, 2023

2023
[30]

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

R. Zhang, L. Lin, Y . Bai, and S. Mei, “Negative preference optimization: From catastrophic collapse to effective unlearning,”arXiv preprint arXiv:2404.05868, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

How and why llms use deprecated apis in code completion? an empirical study,

C. Wang, K. Huang, J. Zhang, Y . Feng, L. Zhang, Y . Liu, and X. Peng, “How and why llms use deprecated apis in code completion? an empirical study,” 06 2024

2024
[32]

Evaluating large language models in class-level code generation,

X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y . Chen, J. Feng, C. Sha, X. Peng, and Y . Lou, “Evaluating large language models in class-level code generation,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, 2024, pp. 1–13

2024
[33]

On the evaluation of large language models in unit test generation,

L. Yang, C. Yang, S. Gao, W. Wang, B. Wang, Q. Zhu, X. Chu, J. Zhou, G. Liang, Q. Wanget al., “On the evaluation of large language models in unit test generation,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1607–1619

2024
[34]

A survey on large language models for code generation,

J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large language models for code generation,”ACM Transactions on Software Engineering and Methodology, vol. 35, no. 2, pp. 1–72, 2026

2026

[1] [1]

Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification,

F. Mu, L. Shi, S. Wang, Z. Yu, B. Zhang, C. Wang, S. Liu, and Q. Wang, “Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification,”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 2332–2354, 2024

2024

[2] [2]

Repairagent: An autonomous, llm-based agent for program repair,

I. Bouzenia, P. Devanbu, and M. Pradel, “Repairagent: An autonomous, llm-based agent for program repair,” in2025 IEEE/ACM 47th Interna- tional Conference on Software Engineering (ICSE). IEEE, 2025, pp. 2188–2200

2025

[3] [3]

Mutation-guided llm-based test generation at meta,

M. Harman, J. Ritchey, I. Harper, S. Sengupta, K. Mao, A. Gulati, C. Foster, and H. Robert, “Mutation-guided llm-based test generation at meta,” inProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, 2025, pp. 180–191

2025

[4] [4]

Can llms replace human evaluators? an empirical study of llm-as-a-judge in soft- ware engineering,

R. Wang, J. Guo, C. Gao, G. Fan, C. Y . Chong, and X. Xia, “Can llms replace human evaluators? an empirical study of llm-as-a-judge in soft- ware engineering,”Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 1955–1977, 2025

1955

[5] [5]

Llm-as-a-judge for software engineering: Literature review, vision, and the road ahead,

J. He, J. Shi, T. Y . Zhuo, C. Treude, J. Sun, Z. Xing, X. Du, and D. Lo, “Llm-as-a-judge for software engineering: Literature review, vision, and the road ahead,”ACM Transactions on Software Engineering and Methodology, 2026

2026

[6] [6]

Large language models for software engineering: Sur- vey and open problems,

A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Sur- vey and open problems,” in2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 2023, pp. 31–53

2023

[7] [7]

Llms meet library evolution: Evaluating deprecated api usage in llm- based code completion,

C. Wang, K. Huang, J. Zhang, Y . Feng, L. Zhang, Y . Liu, and X. Peng, “Llms meet library evolution: Evaluating deprecated api usage in llm- based code completion,” in2025 ieee/acm 47th international conference on software engineering (icse). IEEE, 2025, pp. 885–897

2025

[8] [8]

Is your llm outdated? a deep look at temporal generalization,

C. Zhu, N. Chen, Y . Gao, Y . Zhang, P. Tiwari, and B. Wang, “Is your llm outdated? a deep look at temporal generalization,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 7433–7457

2025

[9] [9]

Humanevo: An evolution-aware benchmark for more realistic evalu- ation of repository-level code generation,

D. Zheng, Y . Wang, E. Shi, R. Zhang, Y . Ma, H. Zhang, and Z. Zheng, “Humanevo: An evolution-aware benchmark for more realistic evalu- ation of repository-level code generation,” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 2025, pp. 1372–1384

2025

[10] [10]

Libevolutioneval: A benchmark and study for version-specific code generation,

S. Kuhar, W. Ahmad, Z. Wang, N. Jain, H. Qian, B. Ray, M. K. Ra- manathan, X. Ma, and A. Deoras, “Libevolutioneval: A benchmark and study for version-specific code generation,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2...

2025

[11] [11]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-K. Dombrowski, S. Goel, L. Phanet al., “The wmdp benchmark: Measuring and reducing malicious use with unlearning,”arXiv preprint arXiv:2403.03218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Model editing for llms4code: How far are we?

X. Li, S. Wang, S. Li, J. Ma, J. Yu, X. Liu, J. Wang, B. Ji, and W. Zhang, “Model editing for llms4code: How far are we?” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 2025, pp. 937–949

2025

[13] [13]

Lightweight model editing for llms to correct deprecated api recommendations,

G. Lin, X. Yu, J. Keung, X. Hu, X. Xia, and A. X. Liu, “Lightweight model editing for llms to correct deprecated api recommendations,” arXiv preprint arXiv:2511.21022, 2025

work page arXiv 2025

[14] [15]

Large language model unlearning for source code,

X. Jiang, Y . Dong, H. Zhang, T. Wang, Z. Fang, Y . Ma, R. Cao, B. Li, Z. Jin, W. Jiaoet al., “Large language model unlearning for source code,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 37, 2026, pp. 31 346–31 355

2026

[15] [16]

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning,

Z. Chu, Y . Wan, Z. Zhang, D. Wang, Z. Yang, H. Zhang, P. Zhou, X. Shi, H. Jin, and D. Lo, “Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning,” inProceedings of the 48th IEEE/ACM International Conference on Software Engineering, April 2026

2026

[16] [17]

Regulation (EU) 2016/679 of the European Parliament and of the Council

C. Fan, J. Liu, L. Lin, J. Jia, R. Zhang, S. Mei, and S. Liu, “Simplicity prevails: Rethinking negative preference optimization for llm unlearn- ing,”arXiv preprint arXiv:2410.07163, 2024

work page arXiv 2024

[17] [18]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . Liet al., “Deepseek-coder: when the large language model meets programming–the rise of code intelligence,”arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [19]

StarCoder 2 and The Stack v2: The Next Generation

A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y . Weiet al., “Starcoder 2 and the stack v2: The next generation,”arXiv preprint arXiv:2402.19173, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [20]

Code Llama: Open Foundation Models for Code

B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remezet al., “Code llama: Open foundation models for code,”arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [21]

Code- gen2: Lessons for training llms on programming and natural languages,

E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y . Zhou, “Code- gen2: Lessons for training llms on programming and natural languages,” arXiv preprint arXiv:2305.02309, 2023

work page arXiv 2023

[21] [22]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [23]

Towards knowledge alignment in code llms: Contrastive unlearning for evolving apis?

Anonymous, “Towards knowledge alignment in code llms: Contrastive unlearning for evolving apis?” https://figshare.com/s/ a8303a2ce6755cf25b0b, 2026, accessed: May 12, 2026

2026

[23] [24]

An empirical study on the impact of refactoring activities on evolving client-used apis,

R. G. Kula, A. Ouni, D. M. German, and K. Inoue, “An empirical study on the impact of refactoring activities on evolving client-used apis,”Inf. Softw. Technol., vol. 93, no. C, p. 186–199, Jan. 2018. [Online]. Available: https://doi.org/10.1016/j.infsof.2017.09.007

work page doi:10.1016/j.infsof.2017.09.007 2018

[24] [25]

2025.LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-Based Code Completion

C. Wang, K. Huang, J. Zhang, Y . Feng, L. Zhang, Y . Liu, and X. Peng, “Llms meet library evolution: Evaluating deprecated api usage in llm-based code completion,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering, ser. ICSE ’25. IEEE Press, 2025, p. 885–897. [Online]. Available: https://doi.org/10.1109/ICSE55347.2025.00245

work page doi:10.1109/icse55347.2025.00245 2025

[25] [26]

Recode: Updating code api knowledge with reinforcement learning,

H. Wu, Y . Yao, W. Yu, and N. Zhang, “Recode: Updating code api knowledge with reinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 40, 2026, pp. 33 908– 33 916

2026

[26] [27]

Towards making systems forget with machine unlearning,

Y . Cao and J. Yang, “Towards making systems forget with machine unlearning,” in2015 IEEE symposium on security and privacy. IEEE, 2015, pp. 463–480

2015

[27] [28]

Knowledge unlearning for mitigating privacy risks in language models,

J. Jang, D. Yoon, S. Yang, S. Cha, M. Lee, L. Logeswaran, and M. Seo, “Knowledge unlearning for mitigating privacy risks in language models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 14 389– 14 408

2023

[28] [29]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in neural information processing systems, vol. 36, pp. 53 728–53 741, 2023

2023

[29] [30]

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

R. Zhang, L. Lin, Y . Bai, and S. Mei, “Negative preference optimization: From catastrophic collapse to effective unlearning,”arXiv preprint arXiv:2404.05868, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [31]

How and why llms use deprecated apis in code completion? an empirical study,

C. Wang, K. Huang, J. Zhang, Y . Feng, L. Zhang, Y . Liu, and X. Peng, “How and why llms use deprecated apis in code completion? an empirical study,” 06 2024

2024

[31] [32]

Evaluating large language models in class-level code generation,

X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y . Chen, J. Feng, C. Sha, X. Peng, and Y . Lou, “Evaluating large language models in class-level code generation,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, 2024, pp. 1–13

2024

[32] [33]

On the evaluation of large language models in unit test generation,

L. Yang, C. Yang, S. Gao, W. Wang, B. Wang, Q. Zhu, X. Chu, J. Zhou, G. Liang, Q. Wanget al., “On the evaluation of large language models in unit test generation,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1607–1619

2024

[33] [34]

A survey on large language models for code generation,

J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large language models for code generation,”ACM Transactions on Software Engineering and Methodology, vol. 35, no. 2, pp. 1–72, 2026

2026