pith. sign in

arxiv: 2605.22981 · v1 · pith:LYSKY53Jnew · submitted 2026-05-21 · 💻 cs.CL · cs.AI· cs.LG

Memorization Dynamics of Fill-in-the-Middle Pretraining

Pith reviewed 2026-05-25 05:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords fill-in-the-middleFIMmemorizationpretrainingverbatim extractionprefix contextlanguage modelsLTR
0
0 comments X

The pith

Fill-in-the-middle pretraining produces linear growth in verbatim memorization with repeated data and keeps recall anchored in prefix context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares fill-in-the-middle (FIM) and left-to-right (LTR) pretraining on a controlled corpus of repeated text excerpts. It shows that verbatim extraction rates under FIM increase roughly in line with the number of repetitions. Prefix-based probes recover more short or partial spans with FIM than with LTR, while native FIM probes demonstrate that suffix context alone does not support strong exact recall. The results indicate that memorization behavior depends on both training objective and probe format. Single-format or single-length evaluations can therefore miss important patterns.

Core claim

In matched Llama 3.2 models trained on FineWeb-Gutenberg excerpts containing artificial repetitions, verbatim extraction under FIM grows approximately linearly with repetition count. Prefix probes lead FIM models to favor shorter or partially matching spans more often than LTR models, which more frequently assign high probability to long exact continuations. When probed in native FIM format, verbatim recall remains strongly dependent on prefix context even when suffix context is supplied.

What carries the argument

Verbatim extraction rate measured across repetition counts using prefix-based probes versus native FIM-format probes.

If this is right

  • Verbatim extraction under FIM training scales linearly with repetitions over the tested range.
  • Suffix context alone is insufficient to produce strong verbatim recall in FIM-trained models.
  • LTR models more readily produce long exact continuations than FIM models under prefix probing.
  • Limiting evaluation to one span length or one probe format can overlook differences in memorization behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training pipelines that use FIM may require separate monitoring of prefix-anchored recall to assess privacy or copyright exposure.
  • Standard benchmark suites that rely on single probe formats could systematically understate or overstate memorization in infilling models.
  • If the linear trend holds at scale, repetition counts in pretraining data become a direct control knob for expected verbatim leakage.

Load-bearing premise

Memorization patterns observed with artificially repeated excerpts in a controlled corpus match the patterns that would appear during large-scale training on naturally occurring data.

What would settle it

A direct measurement on a large naturally occurring corpus showing either clearly nonlinear growth in verbatim extraction or strong suffix-only recall under FIM training would falsify the reported dynamics.

Figures

Figures reproduced from arXiv: 2605.22981 by Tanguy Dieudonn\'e, Tobias von Arx.

Figure 1
Figure 1. Figure 1: Memorization across repetition buckets. For strict full-span extraction, LTR is higher in aggregate, but FIM ex￾tracts more windows at the largest repetition bucket. FIM yields stronger high-overlap recovery for high repetitions. FineWeb is the baseline trained only on FineWeb. Shaded bands denote nominal 95% confidence intervals for the per-window rate. For the exact extraction criterion, LTR overall memo… view at source ↗
Figure 2
Figure 2. Figure 2: Extraction survival curves at repetition 128 show that FIM assigns more mass to moderately likely targets, but LTR has the heavier high-confidence tail. Each line gives the percentage of evaluated target windows with pz ≥ t as the extraction threshold t varies. The 95% confidence intervals are smaller than the line width. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Extraction rates under varying target lengths show that the repetitions required for FIM to overtake LTR increases with span length, because longer spans favor LTR’s heavier tail. Curves show the fraction of windows with pz ≥ 0.1% for the first 20, 30, 40, and 50 target tokens; all panels use the same y-axis scale. Shaded bands denote nominal 95% confidence intervals for the per-window rate. In line with H… view at source ↗
Figure 4
Figure 4. Figure 4: Target-token top-k support under native FIM geometry at 128 repetitions shows that memorization improves monotoni￾cally as more of the 100-token context budget is allocated to the prefix rather than the suffix. The x-axis varies prefix/suffix lengths. The line shows the percentage of target tokens included in top-40 support. The 95% confidence intervals are smaller than the line width. slightly more attent… view at source ↗
Figure 5
Figure 5. Figure 5: Attention allocation under native FIM probing shows that the model uses both surrounding contexts, with more attention on the prefix than the suffix, and shifts attention toward earlier target tokens when little prefix is available. The stacked areas show mean attention mass assigned to prefix tokens, suffix tokens, FIM sentinels, and earlier target tokens within the target span, averaged over target-token… view at source ↗
Figure 7
Figure 7. Figure 7: Mean ROUGE-L under prefix probing for 1B and 3B models, evaluated on 10 uniformly sampled windows per excerpt (left) and on the first window of each excerpt (right). Each prompt uses 100 prefix tokens to generate a 32-token continuation. Filled circles denote 3B models; hollow squares denote 1B models. The large gap between first-window and uniformly sampled-window probing indicates that recall is anchored… view at source ↗
Figure 8
Figure 8. Figure 8: Native FIM geometry by repetition bucket. Heatmaps separate the prefix–suffix effect across repetition levels. The x-axis varies prefix/suffix lengths. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Native FIM probing across prefix–suffix geometry. Metrics are over all repetition buckets. The x-axis varies prefix/suffix lengths. Shaded bands are nominal 95% confidence intervals. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Window that was extracted by both models. Numbers indicate the top-k re-normalized logits of the displayed true target tokens. Repetition 128; source book 54068-0; excerpt 54068-0::window 0000; target start 100; prefix length 100 tokens; target length 32 tokens; pz values: LTR=0.711046, FIM=0.585069. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Window that was only extracted by the FIM-model. Numbers indicate the top-k re-normalized logits of the displayed true target tokens. Repetition 128; source book 57335-0; excerpt 57335-0::window 0002; target start 100; prefix length 100 tokens; target length 32 tokens; pz values: LTR=0.000219776, FIM=0.204912. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Window that was only extracted by the LTR-model. Numbers indicate the top-k re-normalized logits of the displayed true target tokens. Repetition 128; source book 11326-8; excerpt 11326-8::window 0003; target start 100; prefix length 100 tokens; target length 32 tokens; pz values: LTR=0.588202, FIM=0.00063823. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
read the original abstract

Fill-in-the-middle (FIM) is a pretraining objective widely used to equip causal language models with infilling ability, yet its effect on verbatim memorization remains underexplored. We study the memorization dynamics of FIM in a controlled setting by pretraining matched Llama 3.2 models with FIM and standard left-to-right (LTR) objectives on a FineWeb-Gutenberg corpus containing repeated Gutenberg excerpts. With prefix-based probes, FIM more often recovers short or partially matching spans, while LTR more often assigns high confidence to long exact continuations. We observe that verbatim extraction under FIM-training grows approximately linearly with repetitions over the tested range. Evaluating native FIM-format probes reveals that suffix context is not sufficient: verbatim recall under FIM-training remains strongly anchored in prefix context. Our results also show that evaluating only one span length or probing format can miss important nuances in memorization behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper examines the memorization dynamics of fill-in-the-middle (FIM) pretraining compared to left-to-right (LTR) pretraining using matched Llama 3.2 models on a FineWeb-Gutenberg corpus with repeated Gutenberg excerpts. Key findings include that verbatim extraction under FIM grows approximately linearly with repetitions, FIM models more often recover short or partially matching spans while LTR assigns high confidence to long exact continuations, and that verbatim recall remains strongly anchored in prefix context even when using native FIM-format probes. The work also notes that single span length or probe format evaluations can miss nuances.

Significance. If the results hold, the controlled isolation of repetition count provides a useful measurement of how FIM objectives affect verbatim recall patterns, including the linear growth observation and prefix dominance. This contributes empirical data relevant to training objective design and potential memorization risks. The use of matched models and native FIM probes is a strength for isolating effects.

major comments (2)
  1. [§3] §3 (Corpus and Experimental Setup): The central claims of linear growth in verbatim extraction and persistent prefix anchoring rest on a corpus constructed by inserting repeated Gutenberg excerpts into FineWeb. This artificial distribution lacks the variable co-occurrence, partial overlaps, and long-tail frequencies of natural pretraining data, raising a correctness risk for generalizing the reported dynamics beyond the controlled setting.
  2. [Results] Results (linear growth and probe evaluations): The manuscript reports directional findings on span recovery and context anchoring but, consistent with the abstract, provides limited quantitative details on model sizes, exact repetition counts tested, number of runs, or statistical measures such as error bars or fit quality. This weakens assessment of the robustness of the 'approximately linear' claim and the sufficiency of suffix context.
minor comments (1)
  1. [Abstract] Abstract: Including at least one concrete quantitative example (e.g., repetition range or span length) would improve clarity without altering the directional claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying the intent of our controlled experimental design and committing to specific improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Corpus and Experimental Setup): The central claims of linear growth in verbatim extraction and persistent prefix anchoring rest on a corpus constructed by inserting repeated Gutenberg excerpts into FineWeb. This artificial distribution lacks the variable co-occurrence, partial overlaps, and long-tail frequencies of natural pretraining data, raising a correctness risk for generalizing the reported dynamics beyond the controlled setting.

    Authors: We agree that the FineWeb-Gutenberg corpus is deliberately constructed to isolate repetition count as the independent variable. This controlled setup is the core methodological contribution, enabling direct measurement of how FIM versus LTR objectives affect verbatim extraction under matched conditions. The manuscript does not claim that the precise linear growth rates or prefix-anchoring strengths will hold identically in fully naturalistic pretraining distributions; rather, it provides empirical data on objective-specific memorization dynamics that can inform training design. We will revise §3 to explicitly articulate this scope limitation and add a brief discussion of how the observed patterns might interact with natural data statistics. revision: partial

  2. Referee: [Results] Results (linear growth and probe evaluations): The manuscript reports directional findings on span recovery and context anchoring but, consistent with the abstract, provides limited quantitative details on model sizes, exact repetition counts tested, number of runs, or statistical measures such as error bars or fit quality. This weakens assessment of the robustness of the 'approximately linear' claim and the sufficiency of suffix context.

    Authors: We acknowledge that the current presentation would benefit from greater quantitative transparency. In the revision we will expand the Results section (and associated figures/tables) to report: the precise Llama 3.2 model sizes used, the exact repetition counts evaluated, the number of independent training runs, and statistical measures including error bars on extraction rates together with goodness-of-fit metrics for the linear trend. These additions will allow readers to better evaluate the robustness of the linear-growth observation and the prefix-dominance findings under native FIM probes. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical measurement study

full rationale

The paper is an empirical study reporting direct measurements of memorization behavior (e.g., verbatim extraction rates under FIM vs. LTR on a controlled corpus) with no derivations, equations, or fitted parameters that define the reported quantities in terms of themselves. The observed linear growth with repetitions is a measured outcome rather than a self-referential construct. No self-citation chains, ansatzes, or uniqueness theorems are invoked to support central claims. The work is self-contained against external benchmarks as a controlled experiment.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical measurement study; no mathematical derivations, fitted constants, or new postulated entities are introduced. The only background assumptions are standard ones about language model training and the representativeness of the constructed corpus.

pith-pipeline@v0.9.0 · 5691 in / 1279 out tokens · 17547 ms · 2026-05-25T05:43:35.661568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 3 internal anchors

  1. [3]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    URL https://openreview.net/forum? id=TatRHT_1cK. Chen, T., Brahman, F., Liu, J., Mireshghallah, N., Shi, W., Koh, P. W., Zettlemoyer, L., and Hajishirzi, H. ParaPO: Aligning language models to reduce verbatim reproduc- tion of pre-training data. InSecond Conference on Lan- guage Modeling, 2025. URL https://openreview. net/forum?id=Uic3ojVhXh. Clark, P., C...

  2. [4]

    Wu, T., Xiang, C., Wang, J

    URL https://openreview.net/forum? id=d7KBjmI3GmQ. Huang, J., Yang, D., and Potts, C. Demystifying verbatim memorization in large language models. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.),Proceedings of the 2024 Conference on Empirical Methods in Nat- ural Language Processing, pp. 10711–10732, Miami, Florida, USA, November 2024. Association ...

  3. [5]

    emnlp-main.598/

    URL https://aclanthology.org/2024. emnlp-main.598/. Kandpal, N., Wallace, E., and Raffel, C. Deduplicating training data mitigates privacy risks in language mod- els. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv´ari, C., Niu, G., and Sabato, S. (eds.),International Con- ference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA,...

  4. [6]

    Chalkidis, A

    URL https://proceedings.mlr.press/ v162/kandpal22a.html. Kharitonov, E., Baroni, M., and Hupkes, D. How BPE affects memorization in transformers.CoRR, abs/2110.02782, 2021. URL https://arxiv.org/ abs/2110.02782. Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. Deduplicating train- ing data makes language models b...

  5. [7]

    acl-long.577/

    URL https://aclanthology.org/2022. acl-long.577/. Li, R., allal, L. B., Zi, Y ., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., LI, J., Chim, J., Liu, Q., Zheltonozhskii, E., Zhuo, T. Y ., Wang, T., Dehaene, O., Lamy-Poirier, J., Monteiro, J., Gontier, N., Yee, M.- H., Umapathi, L. K., Zhu, J., Lipkin, B., Oblokulov, M., Wang, Z., Murthy, ...

  6. [8]

    The Llama 3 Herd of Models

    URL https://openreview.net/forum? id=KoFOg41haE. Reproducibility Certification. Lin, C.-Y . ROUGE: A package for automatic evalua- tion of summaries. InText Summarization Branches Out, pp. 74–81, Barcelona, Spain, July 2004. Asso- ciation for Computational Linguistics. URL https: //aclanthology.org/W04-1013/. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A....

  7. [9]

    findings-acl.719/

    URL https://aclanthology.org/2023. findings-acl.719/. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. In5th International Confer- ence on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URLhttps://openreview. net/forum?id=Byj72udxe. Nasr, M., R...

  8. [10]

    Code Llama: Open Foundation Models for Code

    URL https://openreview.net/forum? id=n6SCkn2QaG. Project Gutenberg. Project gutenberg. https://www. gutenberg.org, n.d. Accessed: 2026-05-04. Rozi`ere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X., Adi, Y ., Liu, J., Remez, T., Rapin, J., Kozhevnikov, A., Evtimov, I., Bitton, J., Bhatt, M. P., Ferrer, C. C., Grattafiori, A., Xiong, W., D’ef...

  9. [11]

    org/CorpusID:261100919

    URL https://api.semanticscholar. org/CorpusID:261100919. Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y . Winogrande: An adversarial winograd schema challenge at scale. InThe Thirty-Fourth AAAI Conference on Ar- tificial Intelligence, AAAI 2020, The Thirty-Second In- novative Applications of Artificial Intelligence Confer- ence, IAAI 2020, The T...

  10. [12]

    doi: 10.1609/AAAI.V34I05

    AAAI Press, 2020. doi: 10.1609/AAAI.V34I05

  11. [13]

    Louis, G

    URL https://doi.org/10.1609/aaai. v34i05.6399. Shi, W., Ajith, A., Xia, M., Huang, Y ., Liu, D., Blevins, T., Chen, D., and Zettlemoyer, L. Detecting pretrain- ing data from large language models. InThe Twelfth International Conference on Learning Representations,

  12. [14]

    Shilov, I., Meeus, M., and de Montjoye, Y .-A

    URL https://openreview.net/forum? id=zWqr3MQuNs. Shilov, I., Meeus, M., and de Montjoye, Y .-A. The mo- saic memory of large language models.Nature Com- munications, 17(1), Jan 2026. ISSN 2041-1723. doi: 10.1038/s41467-026-68603-0. URL http://dx.doi. org/10.1038/s41467-026-68603-0. Talmor, A., Herzig, J., Lourie, N., and Berant, J. Com- monsenseqa: A ques...

  13. [15]

    Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y

    URL https://openreview.net/forum? id=7dBPm5c5ue. Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? In Korhonen, A., Traum, D. R., and M`arquez, L. (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 20...

  14. [16]

    URL https:// doi.org/10.18653/v1/p19-1472

    doi: 10.18653/V1/P19-1472. URL https:// doi.org/10.18653/v1/p19-1472. Zhang, C., Ippolito, D., Lee, K., Jagielski, M., Tram`er, F., and Carlini, N. Counterfactual memorization in neural language models. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https: //openreview.net/forum?id=67o9UQgTD0. 7 Memorization Dynamics of Fi...