pith. machine review for the scientific record. sign in

arxiv: 2604.13997 · v1 · submitted 2026-04-15 · 💻 cs.SE

Recognition: unknown

Learned or Memorized ? Quantifying Memorization Advantage in Code LLMs

Djir\'e Alb\'erick Euraste, Earl T. Barr, Jacques Klein, Jordan Samhi, Kabor\'e Abdoul Kader, Tegawend\'e F. Bissyand\'e

Pith reviewed 2026-05-10 12:33 UTC · model grok-4.3

classification 💻 cs.SE
keywords memorization advantagecode LLMsperturbation methoddata leakagecode generationvulnerability detectionbug fixingbenchmark evaluation
0
0 comments X

The pith

A perturbation method quantifies memorization advantage in code LLMs as the performance gap on likely seen versus unseen inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a perturbation-based approach to measure how much code LLMs depend on memorized training examples rather than learned patterns. It applies this to eight open-source models across nineteen benchmarks in code generation, understanding, vulnerability detection, and bug fixing. Results show wide variation: some models display high sensitivity to perturbations on certain tasks while others remain relatively stable, and task types differ markedly with summarization showing lower memorization advantage than test generation. Two commonly suspected benchmarks, CVEFixes and Defects4J, exhibit low advantage across models, implying greater reliance on generalization than direct recall in those cases. This matters for assessing when models can be trusted on novel code without hidden leakage effects.

Core claim

We present a perturbation-based method to quantify memorization advantage in code LLMs, defined as the performance gap between likely seen and unseen inputs. We evaluate 8 open-source code LLMs on 19 benchmarks across four task families: code generation, code understanding, vulnerability detection, and bug fixing. Sensitivity patterns vary widely across models and tasks. For example, StarCoder reaches high sensitivity on some benchmarks (up to 0.8), while QwenCoder remains lower (mostly below 0.4), suggesting differences in generalization behavior. Task categories also differ: code summarization tends to show low sensitivity, whereas test generation is substantially higher. We then analyze 2

What carries the argument

Perturbation-based method that alters code inputs to create likely unseen versions and measures the resulting performance drop as memorization advantage.

If this is right

  • Memorization advantage reaches high levels (up to 0.8) in some models like StarCoder on certain benchmarks but stays low (below 0.4) in others like QwenCoder.
  • Task type strongly influences the advantage, with code summarization showing low sensitivity while test generation shows substantially higher sensitivity.
  • CVEFixes displays memorization advantage below 0.1 and Defects4J lower than other program repair benchmarks, pointing to generalization rather than memorization on these sets.
  • Evaluation protocols for code LLMs need strengthening, especially for security-related tasks where undetected memorization could mask vulnerabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models with consistently low memorization advantage may be preferable for applications involving novel or proprietary code where leakage risks must be minimized.
  • The method could extend to testing whether fine-tuning on synthetic data reduces memorization advantage on downstream benchmarks.
  • Low advantage on security benchmarks may indicate that concerns about training-data leakage in vulnerability detection have been overstated for current open models.
  • Perturbations could be refined to target specific code structures like API calls or control flows to isolate memorization of particular patterns.

Load-bearing premise

The selected perturbations reliably generate inputs the model never saw in training, and any performance gap measures memorization rather than general input sensitivity or task difficulty.

What would settle it

Apply the method to a model trained only on data known to exclude the original benchmark examples and check whether the performance gap on perturbed inputs disappears or remains.

Figures

Figures reproduced from arXiv: 2604.13997 by Djir\'e Alb\'erick Euraste, Earl T. Barr, Jacques Klein, Jordan Samhi, Kabor\'e Abdoul Kader, Tegawend\'e F. Bissyand\'e.

Figure 1
Figure 1. Figure 1: GPT_4o text completion performance falloff for a memorized Shakespeare poem submitted to perturbations vs gradual performance decline with a recent BBC text (could not part of the training set of GPT_4o). 2.5 Memorization Advantage in LLM Evaluation In this work, our definition of memorization advantage extends the perturbation sensitivity concept by quantifying performance disparities between a model’s ha… view at source ↗
Figure 2
Figure 2. Figure 2: Memorization advantage example: impact of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples Outputs by StarCoder2 when applying [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example Outputs by StarCoder when apply pertur [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Perturbation sensitivity distributions across test [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Perturbation sensitivity distributions across vul [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Perturbation sensitivity distributions for code sum [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

The lack of transparency about code datasets used to train large language models (LLMs) makes it difficult to detect, evaluate, and mitigate data leakage. We present a perturbation-based method to quantify memorization advantage in code LLMs, defined as the performance gap between likely seen and unseen inputs. We evaluate 8 open-source code LLMs on 19 benchmarks across four task families: code generation, code understanding, vulnerability detection, and bug fixing. Sensitivity patterns vary widely across models and tasks. For example, StarCoder reaches high sensitivity on some benchmarks (up to 0.8), while QwenCoder remains lower (mostly below 0.4), suggesting differences in generalization behavior. Task categories also differ: code summarization tends to show low sensitivity, whereas test generation is substantially higher. We then analyze two widely discussed benchmarks, CVEFixes and Defects4J, often suspected of leakage. Contrary to common concerns, both show low memorization advantage across models: CVEFixes remains below 0.1, and Defects4J is lower than other program repair benchmarks. These results suggest that, for these datasets, models may rely more on learned generalization than direct memorization. Overall, our findings provide evidence that memorization risk is highly task- and model-dependent, and highlight the need for stronger evaluation protocols, especially in security-focused settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a perturbation-based method to quantify memorization advantage in code LLMs, defined as the performance gap between original (likely seen) and perturbed (likely unseen) inputs. The authors evaluate 8 open-source code LLMs on 19 benchmarks spanning code generation, understanding, vulnerability detection, and bug fixing. They report varying sensitivity patterns, with examples like StarCoder showing high sensitivity (up to 0.8) on some benchmarks and QwenCoder lower (below 0.4), low sensitivity on code summarization versus higher on test generation, and notably low memorization advantage on CVEFixes (<0.1) and Defects4J compared to other repair benchmarks, concluding that memorization risk is task- and model-dependent and calling for stronger evaluation protocols.

Significance. If the perturbation method successfully isolates memorization effects from other factors like input robustness or task sensitivity, this work would offer a valuable empirical framework for detecting data leakage in code LLMs without requiring training data access. The findings challenge assumptions about leakage in security-related benchmarks like CVEFixes and Defects4J, and underscore the need for more robust evaluation practices. The empirical scale (8 models, 19 benchmarks) provides broad coverage, though the lack of detailed controls limits immediate impact.

major comments (3)
  1. [Abstract / Method] Abstract and method description: The central definition of memorization advantage as the performance delta between original and perturbed inputs assumes that (1) the perturbations reliably generate inputs absent from training data and (2) the observed gap is caused by memorization rather than changes in input difficulty, syntactic validity, or general robustness. No perturbation operators, validation that perturbed samples were unseen, or control experiments (e.g., on models trained on disjoint data) are described, leaving the causal attribution unverified and directly affecting the task- and model-dependence claims.
  2. [Results on CVEFixes and Defects4J] CVEFixes and Defects4J analysis: The claim that these benchmarks exhibit low memorization advantage (CVEFixes below 0.1; Defects4J lower than other program repair benchmarks) is used to argue against common leakage concerns. However, without reported statistical significance tests, error bars, or analysis of how perturbation strength affects the gap, it remains unclear whether the low values reflect genuine generalization or insufficiently disruptive perturbations.
  3. [Evaluation setup] Evaluation across task families: The paper contrasts sensitivity across four task families (e.g., low for summarization, high for test generation) but provides no explicit definition or formula for the 'sensitivity' metric in the abstract, nor details on consistent performance measurement (e.g., pass@k for generation vs. F1 for vulnerability detection). This makes cross-task comparisons load-bearing for the model-dependence conclusion.
minor comments (2)
  1. The abstract refers to 'sensitivity patterns' without an immediate inline definition tying it back to the memorization advantage gap; a brief parenthetical or footnote would improve readability.
  2. Consider including a summary table listing the 19 benchmarks, their task families, and key metrics to aid readers in interpreting the variation claims.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback, which helps clarify key aspects of our perturbation-based approach to quantifying memorization advantage. We address each major comment below with clarifications and planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description: The central definition of memorization advantage as the performance delta between original and perturbed inputs assumes that (1) the perturbations reliably generate inputs absent from training data and (2) the observed gap is caused by memorization rather than changes in input difficulty, syntactic validity, or general robustness. No perturbation operators, validation that perturbed samples were unseen, or control experiments (e.g., on models trained on disjoint data) are described, leaving the causal attribution unverified and directly affecting the task- and model-dependence claims.

    Authors: We will expand the Method section in the revision to detail the specific perturbation operators (e.g., identifier renaming, statement reordering, and comment insertion/deletion, chosen to preserve semantics and syntactic validity). We agree that direct validation of 'unseen' status is impossible without training data access—this is a fundamental limitation of leakage-detection methods that do not require such access, which is the motivation for our work. For causal attribution, we will add discussion explaining that the gap likely reflects memorization because perturbations target surface-level patterns while keeping task semantics intact; we will also note the absence of disjoint-data controls as a limitation. These additions support rather than undermine the task- and model-dependence findings. revision: partial

  2. Referee: [Results on CVEFixes and Defects4J] CVEFixes and Defects4J analysis: The claim that these benchmarks exhibit low memorization advantage (CVEFixes below 0.1; Defects4J lower than other program repair benchmarks) is used to argue against common leakage concerns. However, without reported statistical significance tests, error bars, or analysis of how perturbation strength affects the gap, it remains unclear whether the low values reflect genuine generalization or insufficiently disruptive perturbations.

    Authors: We agree that statistical rigor would strengthen the claims. In the revised version, we will include error bars on all memorization advantage plots, conduct paired statistical tests (e.g., t-tests) on original vs. perturbed performance, and add an analysis varying perturbation intensity (e.g., single vs. multiple operators) to confirm the low values on CVEFixes (<0.1) and Defects4J are not artifacts of weak perturbations. This will better support the conclusion that these benchmarks show limited memorization advantage. revision: yes

  3. Referee: [Evaluation setup] Evaluation across task families: The paper contrasts sensitivity across four task families (e.g., low for summarization, high for test generation) but provides no explicit definition or formula for the 'sensitivity' metric in the abstract, nor details on consistent performance measurement (e.g., pass@k for generation vs. F1 for vulnerability detection). This makes cross-task comparisons load-bearing for the model-dependence conclusion.

    Authors: We will update the abstract and add an explicit subsection in Methods defining the sensitivity metric as the normalized performance gap: (P_original - P_perturbed) / max(P_original, epsilon), where P is the task-specific metric. We will also include a table specifying the exact metric for each task family (pass@k for code/test generation, F1 for vulnerability detection, BLEU/ROUGE for summarization) to ensure transparent cross-task comparisons. These changes will make the model- and task-dependence claims more robust. revision: yes

standing simulated objections not resolved
  • Direct empirical validation that perturbed inputs were absent from training data, or control experiments using models trained on fully disjoint datasets, cannot be performed without access to the original (often proprietary) training corpora.

Circularity Check

0 steps flagged

No circularity: empirical performance-gap measurement is self-contained

full rationale

The paper introduces a perturbation-based method and explicitly defines memorization advantage as the observed performance gap between original and perturbed inputs. It then reports measured gaps across 8 models and 19 benchmarks without any equations, fitted parameters, or derivations that reduce the reported quantities to the definition by construction. No self-citations are invoked as load-bearing premises, no uniqueness theorems are imported, and no ansatz or renaming of known results occurs. The central findings (task- and model-dependent sensitivity, low gaps on CVEFixes/Defects4J) rest on direct empirical comparison rather than tautological re-expression of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that perturbations create likely-unseen inputs whose performance gap measures memorization; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Perturbed code inputs are likely unseen during model training
    This underpins the definition of memorization advantage as the performance gap between seen and unseen inputs.

pith-pipeline@v0.9.0 · 5578 in / 1365 out tokens · 39863 ms · 2026-05-10T12:33:10.348560+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Deep learning based vulnerability detection: Are we there yet?

    Extracting training data from large language models. In30th USENIX security symposium (USENIX Security 21). 2633–2650. Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. 2022. Deep Learning Based Vulnerability Detection: Are We There Yet? 3280-3296 pages. doi:10.1109/TSE.2021.3087402 Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen...

  2. [2]

    Chatunitest: A framework for llm-based test generation,

    ChatUniTest: A Framework for LLM-Based Test Generation. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering(Porto de Galinhas, Brazil)(FSE 2024). Association for Computing Machinery, New York, NY, USA, 572–576. doi:10.1145/3663529.3663801 Nurit Cohen-Inger, Yehonatan Elisha, Bracha Shapira, Lior Roka...

  3. [3]

    doi:10.48550/arXiv.2502.07445 arXiv:2502.07445 [cs]

    Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon. doi:10.48550/arXiv.2502.07445 arXiv:2502.07445 [cs]. Verna Dankers and Ivan Titov. 2024. Generalisation First, Memorisation Second? Memorisation Localisation for Natural Language Classification Tasks. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, ...

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    VulRepair: a T5-based automated software vulnerability repair. InProceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering. 935–947. Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, et a...

  5. [5]

    https://arxiv.org/pdf/2411.04905 Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al

    OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models. https://arxiv.org/pdf/2411.04905 Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186(2024). Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadi...

  6. [6]

    Vuldetectbench: Evaluating the deep capability of vulnerability detection with large language models.arXiv preprint arXiv:2406.07595(2024). Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada,...

  7. [7]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions.arXiv preprint arXiv:2406.15877(2024)