How Training Data Shapes the Use of Parametric and In-Context Knowledge in Language Models

Dong-Kyum Kim; Jea Kwon; Kyomin Jung; Meeyoung Cha; Minsung Kim; Nakyeong Yang

arxiv: 2510.02370 · v3 · submitted 2025-09-29 · 💻 cs.CL · cs.AI

How Training Data Shapes the Use of Parametric and In-Context Knowledge in Language Models

Minsung Kim , Dong-Kyum Kim , Jea Kwon , Nakyeong Yang , Kyomin Jung , Meeyoung Cha This is my paper

Pith reviewed 2026-05-18 13:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords language modelsparametric knowledgein-context knowledgeknowledge arbitrationtraining datasynthetic corporapretraining

0 comments

The pith

Balanced arbitration between parametric and in-context knowledge emerges only when training data combines intra-document repetition, moderate inconsistency, and skewed distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how pretraining data properties influence whether language models draw on their learned parametric knowledge or defer to information given in the prompt when the two conflict. Using controlled synthetic corpora, the experiments isolate three data traits that together produce reliable arbitration: repeated statements inside documents, a moderate amount of conflicting facts within documents, and uneven coverage of different facts across the corpus. These traits, usually minimized during data cleaning, turn out to be jointly necessary for models to show high-confidence use of parametric knowledge on familiar items and context use on less familiar ones. The authors further observe the same pattern in actual pretraining runs and track how later alignment steps modify the resulting behavior.

Core claim

When parametric and in-context knowledge conflict, models prefer parametric knowledge for high-confidence facts while deferring to context for less familiar ones. Controlled experiments with synthetic corpora show that this balanced arbitration arises as an emergent property precisely when the training data contains intra-document repetition, a moderate degree of intra-document inconsistency, and a skewed knowledge distribution. These three conditions, often viewed as detrimental, must occur together. The same dynamics appear in real-world language-model pretraining, and post-training procedures can reshape the arbitration strategies that result.

What carries the argument

Synthetic corpora that systematically vary intra-document repetition, intra-document inconsistency, and knowledge-distribution skew to measure resulting changes in model arbitration between parametric and in-context knowledge.

If this is right

Only the joint presence of repetition, moderate inconsistency, and skew produces robust balanced arbitration.
The same three data properties occur naturally during standard language-model pretraining.
Post-training procedures can shift the balance toward greater reliance on either parametric or in-context knowledge.
Training-data design should deliberately preserve moderate levels of these three properties to support reliable knowledge integration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data pipelines that aggressively remove repetition and inconsistency may unintentionally weaken a model's ability to decide when to trust its own knowledge.
Controlled injection of moderate inconsistency during pretraining could be tested as a lightweight way to improve context utilization without adding parameters.
The finding suggests a possible link between data skew and reduced hallucination rates on high-confidence facts.
Scaling the same synthetic-corpus design to larger models would test whether the three-factor requirement holds beyond the studied regime.

Load-bearing premise

The specific data properties isolated in the synthetic-corpora experiments are the causal drivers of knowledge-arbitration behavior in large-scale real-world pretraining rather than artifacts of the controlled setup.

What would settle it

Train otherwise identical models on synthetic corpora that each omit one of the three factors and check whether balanced arbitration between parametric and in-context knowledge disappears in every case that lacks the full combination.

Figures

Figures reproduced from arXiv: 2510.02370 by Dong-Kyum Kim, Jea Kwon, Kyomin Jung, Meeyoung Cha, Minsung Kim, Nakyeong Yang.

**Figure 1.** Figure 1: Three knowledge utilization scenarios. Left: parametric knowledge utilization where the model recalls knowledge encoded in its parameters and answers queries about entities seen during training. Middle: in-context knowledge utilization where the model extracts and uses knowledge provided only in the prompt and is evaluated on novel entities not seen during training. Right: knowledge conflict resolution whe… view at source ↗

**Figure 2.** Figure 2: An example of intra-document repetition of key attributes (e.g., German, Physics) for [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy of parametric knowledge utilization (Acc [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Training dynamics of AccICKU, AccPKU, Pref ICK, and PrefPK when trained on the REPEATED+MIX corpus without noise (Left) and with 1% noise (Right). When the training corpus contains no noise (i.e., no inconsistent knowledge within the same documents), the model consistently prefers in-context knowledge in knowledge conflicts, whereas even a small amount of noise induces a phase shift toward parametric … view at source ↗

**Figure 5.** Figure 5: AccPKU, Pref ICK, and PrefPK for the top 10% (high-frequency) and bottom 10% (lowfrequency) entities in the training corpus. For high-frequency entities, Pref ICK is initially higher but gradually yields to PrefPK; for low-frequency entities, Pref ICK remains consistently higher. p ∈ {1%, 5%, 10%}, replacing them with randomly sampled alternatives, and leave the later paragraph unchanged ( [PITH_FULL_IM… view at source ↗

**Figure 6.** Figure 6: Bar plots of PrefPK under knowledge conflict (red) and mean entropy in the parametricknowledge-utilization setting (blue) Bins are ordered by Zipfian rank, where lower rank denotes higher frequency. Left: Results with zipfian training corpus without inconsistency noise. Right: Results with zipfian training corpus with a small amount(1%) of inconsistency noise. We further evaluated preference measures acro… view at source ↗

**Figure 7.** Figure 7: AccICKU, AccPKU, Pref ICK, and PrefPK in Pythia checkpoints. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: An example of the synthetic dataset. Each profile consists of four attributes ( [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Example of the document injected inconsistency noise [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Training dynamics of AccICKU and AccPKU under different numbers of training entities [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Training dynamics of AccICKU and AccPKU under different levels of intra-document inconsistency noise [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Training dynamics of AccICKU and AccPKU as a function of the Zipf exponent α [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Three knowledge utilization scenarios in real-world large language models. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

read the original abstract

Large language models leverage both parametric knowledge acquired during pretraining and in-context knowledge provided at inference time. Crucially, when these sources conflict, models arbitrate based on their internal confidence, preferring parametric knowledge for high-confidence facts while deferring to context for less familiar ones. However, the training conditions that give rise to these fundamental behaviors remain unclear. Here we conduct controlled experiments using synthetic corpora to identify the specific data properties that shape knowledge utilization. Our results reveal a counterintuitive finding: the robust, balanced use of both knowledge sources is an emergent property that requires the co-occurrence of three factors typically considered detrimental, including (i) intra-document repetition, (ii) a moderate degree of intra-document inconsistency, and (iii) a skewed knowledge distribution. We further show that these dynamics arise in real-world language model pretraining and analyze how post-training procedures reshape arbitration strategies. Together, our findings provide empirical guidance for designing training data that supports the reliable integration of parametric and in-context knowledge in language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper identifies a specific trio of data properties that together produce balanced parametric and in-context knowledge use, with controlled synthetics doing the heavy lifting but real-world links staying observational.

read the letter

Colleague, the main takeaway is that balanced arbitration between what a model knows from training and what it sees in the prompt only shows up reliably when training data combines intra-document repetition, moderate inconsistency, and skewed distributions all at once. That joint requirement is the central new claim here. The synthetic experiments isolate those three factors cleanly and demonstrate that none of them alone is enough; you need the full set for the model to switch sensibly between sources instead of defaulting to one. This combination is not laid out in the prior work referenced in the abstract, and the controlled setups give a direct test of necessity that feels useful for thinking about data design. They also track how post-training changes the arbitration patterns, which adds a practical layer. The softer spot is the move from synthetics to real pretraining. The real-world section reports that similar arbitration behaviors appear in existing models and corpora, but it does not intervene by changing those three properties while holding other statistics fixed. Other co-varying factors like document length or topic spread could be responsible instead. The abstract also skips details on sample sizes and how analysis decisions were made, so the support for the main result is only partial. This is for researchers who work on pretraining data curation and want empirical rules for making knowledge use more predictable in LLMs. Someone focused on reliability when facts conflict would find the property breakdowns worth reading. I would send it for peer review. The synthetic isolation is concrete enough to justify referee time, though the authors will need to tighten the causal story for natural data.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that large language models arbitrate between parametric knowledge (from pretraining) and in-context knowledge (at inference) based on internal confidence, and that the robust, balanced use of both sources is an emergent property requiring the co-occurrence of three training-data factors: intra-document repetition, moderate intra-document inconsistency, and skewed knowledge distribution. These conclusions are drawn from controlled experiments on synthetic corpora, with supporting observations from real-world pretraining corpora and analysis of how post-training procedures alter arbitration strategies.

Significance. If the central results hold, the work supplies concrete empirical guidance for constructing pretraining data that promotes reliable integration of parametric and in-context knowledge. The controlled synthetic design is a clear methodological asset, as it permits isolation of the three targeted data properties. The counterintuitive finding that repetition and moderate inconsistency can be beneficial rather than detrimental could influence future data-curation practices.

major comments (1)

[Real-world pretraining analysis] The section on real-world language model pretraining reports that similar arbitration patterns appear in existing pretrained models or corpora but does not perform interventions that manipulate intra-document repetition, inconsistency, or knowledge skew while holding other statistics (document length, topic diversity, token frequencies) fixed. Consequently the observational evidence cannot establish that the three synthetic factors are the causal drivers rather than artifacts of uncontrolled covariates.

minor comments (2)

[Abstract and experimental methods] The abstract and experimental sections provide no details on sample sizes, statistical controls, or criteria for post-hoc analysis choices, which limits assessment of the reliability of the reported patterns.
[Synthetic corpus construction] Precise operational definitions and quantitative metrics for 'intra-document repetition,' 'moderate inconsistency,' and 'skewed knowledge distribution' should be stated explicitly when describing the synthetic corpus construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the manuscript's significance and methodology. We address the major comment below, clarifying the respective roles of our controlled experiments and observational analysis.

read point-by-point responses

Referee: [Real-world pretraining analysis] The section on real-world language model pretraining reports that similar arbitration patterns appear in existing pretrained models or corpora but does not perform interventions that manipulate intra-document repetition, inconsistency, or knowledge skew while holding other statistics (document length, topic diversity, token frequencies) fixed. Consequently the observational evidence cannot establish that the three synthetic factors are the causal drivers rather than artifacts of uncontrolled covariates.

Authors: We agree that the real-world analysis is observational and does not involve interventions that manipulate the three factors while holding all other statistics fixed. Our causal claims rest on the controlled synthetic corpus experiments (Sections 3 and 4), where intra-document repetition, moderate inconsistency, and knowledge skew are varied independently with other variables held constant. The Section 5 analysis demonstrates that the same arbitration patterns emerge in existing pretrained models and natural corpora, providing evidence of ecological validity rather than independent causal proof. In revision we will add explicit language to Section 5 and the associated figure captions stating that this component is correlational and that causality is established via the synthetic controls. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical experimental chain

full rationale

The paper's claims rest on controlled variation of data properties (repetition, inconsistency, distribution skew) across synthetic corpora, with results reported as emergent from those manipulations. No equations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described methods. Real-world analysis is observational comparison rather than a derivation that reduces to the synthetic inputs by construction. The work is self-contained empirical science without load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that synthetic data manipulations isolate the true causal factors governing real pretraining dynamics.

axioms (1)

domain assumption Synthetic corpora manipulations isolate the causal effects of repetition, inconsistency, and skew on knowledge arbitration in a manner that generalizes to natural language pretraining.
Invoked to treat the controlled experiments as revealing fundamental training conditions rather than setup-specific artifacts.

pith-pipeline@v0.9.0 · 5720 in / 1207 out tokens · 46500 ms · 2026-05-18T13:06:27.821165+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our experiments reveal that intra-document repetition of facts fosters the development of both parametric and in-context capabilities... a small degree of factual inconsistency... skewed frequency distribution... produce the desired arbitration pattern
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train an 8-layer decoder-only Transformer... on a synthetic biographies corpus while systematically controlling various conditions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation
cs.CL 2026-05 conditional novelty 6.0

A three-regime framework resolves contradictions in LLM context vs. parametric knowledge conflicts by distinguishing single-source updating, competitive integration, and task-appropriate selection, with empirical conf...

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Physics of language models: Part 3.1, knowledge storage and extraction

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction, 2024a. URLhttps://arxiv.org/abs/2309.14316. Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manipulation, 2024b. URLhttps://arxiv.org/abs/2309.14402. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie B...

work page arXiv
[2]

Language Models are Few-Shot Learners

URL https://arxiv.org/abs/2005.14165. Stephanie Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, and Felix Hill. Data distributional properties drive emergent in-context learn- ing in transformers.Advances in neural information processing systems, 35:18878–18891,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[3]

Transformer Feed-Forward Layers Are Key-Value Memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories.arXiv preprint arXiv:2012.14913,

work page internal anchor Pith review Pith/arXiv arXiv 2012
[4]

arXiv preprint arXiv:2304.14767 , year=

Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models.arXiv preprint arXiv:2304.14767,

work page arXiv
[5]

Linearity of relation decoding in transformer language models

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. arXiv preprint arXiv:2308.09124,

work page arXiv
[6]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Cutting off the head ends the conflict: A mechanism for interpret- ing and mitigating knowledge conflicts in language models

Zhuoran Jin, Pengfei Cao, Hongbang Yuan, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao. Cutting off the head ends the conflict: A mechanism for interpret- ing and mitigating knowledge conflicts in language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: A...

work page 2024
[8]

doi: 10.18653/v1/2024.findings-acl.70

Association for Computational Linguis- tics. doi: 10.18653/v1/2024.findings-acl.70. URLhttps://aclanthology.org/2024. findings-acl.70/. Asher Koriat. The self-consistency model of subjective confidence.Psychological Review, 119: 80–113, 10

work page doi:10.18653/v1/2024.findings-acl.70 2024
[9]

Brenden M

doi: 10.1037/a0025648. Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. The omniglot challenge: a 3-year progress report,

work page doi:10.1037/a0025648
[10]

URLhttps://arxiv.org/abs/1902.03477. 3https://huggingface.co/docs/trl/index 10 Preprint Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[11]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

URLhttps: //arxiv.org/abs/2005.11401. Gaotang Li, Yuzhong Chen, and Hanghang Tong. Taming knowledge conflicts in language models. InF orty-second International Conference on Machine Learning. Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of para...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[12]

Dis- entqa: Disentangling parametric and contextual knowledge with counterfactual question answer- ing.arXiv preprint arXiv:2211.05655,

Ella Neeman, Roee Aharoni, Or Honovich, Leshem Choshen, Idan Szpektor, and Omri Abend. Dis- entqa: Disentangling parametric and contextual knowledge with counterfactual question answer- ing.arXiv preprint arXiv:2211.05655,

work page arXiv
[13]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

doi: 10.18653/v1/2024.acl-long.458

Association for Computational Lin- guistics. doi: 10.18653/v1/2024.acl-long.458. URLhttps://aclanthology.org/2024. acl-long.458/. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9,

work page doi:10.18653/v1/2024.acl-long.458 2024
[15]

In-context retrieval-augmented language models

URLhttps://arxiv. org/abs/2302.00083. Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettle- moyer, and Wen tau Yih. Replug: Retrieval-augmented black-box language models,

work page arXiv
[16]

REPLUG: Retrieval-Augmented Black-Box Language Models

URL https://arxiv.org/abs/2301.12652. Zhongxiang Sun, Xiaoxue Zang, Kai Zheng, Yang Song, Jun Xu, Xiao Zhang, Weijie Yu, Yang Song, and Han Li. Redeep: Detecting hallucination in retrieval-augmented generation via mechanistic interpretability,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al

URLhttps://arxiv.org/abs/2410.11414. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page arXiv
[18]

URLhttps://arxiv.org/abs/2403. 08319. 11 Preprint Qinan Yu, Jack Merullo, and Ellie Pavlick. Characterizing mechanisms for factual recall in language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing, pp. 9924–9959, Singapore, Decem- ber

work page 2023
[19]

doi: 10.18653/v1/2023.emnlp-main.615

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.615. URLhttps://aclanthology.org/2023.emnlp-main.615/. G.K. Zipf.Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Martino Fine Books,

work page doi:10.18653/v1/2023.emnlp-main.615 2023
[20]

needle-in-the-haystack problem

ISBN 9781614273127. URLhttps://books.google.co. kr/books?id=nR06MAEACAAJ. Nicolas Zucchet, J¨org Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, and Soham De. How do language models learn facts? dynamics, curricula and hallucinations.arXiv preprint arXiv:2503.21676,

work page arXiv
[21]

Each profile contains four attributes:birth date,birth city, university, andmajor

A SYNTHETICBIOGRAPHIESDATASETCONSTRUCTION Following prior work (Allen-Zhu & Li, 2024a; Zucchet et al., 2025), we first constructN synthetic person profiles. Each profile contains four attributes:birth date,birth city, university, andmajor. Names (first/middle/last) are sampled by randomly composing entries from a public name database.4 Forbirth date, we s...

work page 2025
[22]

Component Value Embedding dimension 512 Layers 8 Attention heads 8 FFN inner dimension 2048 Context length 512 Table 4: Training hyperparameters

B DETAILS ONTRAININGLANGUAGEMODELS Table 3: Model architecture. Component Value Embedding dimension 512 Layers 8 Attention heads 8 FFN inner dimension 2048 Context length 512 Table 4: Training hyperparameters. Hyperparameter Value Max training steps 16,000 Batch size 128 Learning rate4×10 −4 Weight decay 0.10 LR scheduler Cosine Sequence length 512 Numeri...

work page 2048
[23]

(2022), we adopt the settings used in Zucchet et al

Following Hoffmann et al. (2022), we adopt the settings used in Zucchet et al. (2025). The training hyperparameters are listed in Table

work page 2022
[24]

November 10, 2079

6https://huggingface.co/openai-community/gpt2 13 Preprint C EXAMPLE OFFACTUALINCONSISTENCYNOISE WITHIN ADOCUMENT Figure 9 illustrates a document from the REPEATED+MIXcorpus in which factual inconsistency noise has been injected. The value highlighted in pink was injected as noise with some probability and therefore does not match the latter original value...

work page 2079
[25]

Roselee Justine Woolem first opened their eyes in Phoenix, AZ

Annika Klara Wickizer was educated in the field of Information Systems.Roselee Justine Woolem gained academic grounding in Business Analytics. Roselee Justine Woolem first opened their eyes in Phoenix, AZ. Roselee Justine Woolem studied at Hamilton College. Roselee Justine Woolem was brought into the world on August 12, 2083.Roselee Justine Woolem entered...

work page 2083
[26]

Roselee Justine Woolem began their life in Phoenix, AZ

Roselee Justine Woolem majored in Business Analytics. Roselee Justine Woolem began their life in Phoenix, AZ. Roselee Justine Woolem developed expertise at Hamilton College. Figure 9: Example of the document injected inconsistency noise D ADDITIONALEXPERIMENTALRESULTS We further examine the training dynamics by systematically varying several factors. Unle...

work page 2023

[1] [1]

Physics of language models: Part 3.1, knowledge storage and extraction

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction, 2024a. URLhttps://arxiv.org/abs/2309.14316. Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manipulation, 2024b. URLhttps://arxiv.org/abs/2309.14402. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie B...

work page arXiv

[2] [2]

Language Models are Few-Shot Learners

URL https://arxiv.org/abs/2005.14165. Stephanie Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, and Felix Hill. Data distributional properties drive emergent in-context learn- ing in transformers.Advances in neural information processing systems, 35:18878–18891,

work page internal anchor Pith review Pith/arXiv arXiv 2005

[3] [3]

Transformer Feed-Forward Layers Are Key-Value Memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories.arXiv preprint arXiv:2012.14913,

work page internal anchor Pith review Pith/arXiv arXiv 2012

[4] [4]

arXiv preprint arXiv:2304.14767 , year=

Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models.arXiv preprint arXiv:2304.14767,

work page arXiv

[5] [5]

Linearity of relation decoding in transformer language models

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. arXiv preprint arXiv:2308.09124,

work page arXiv

[6] [6]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Cutting off the head ends the conflict: A mechanism for interpret- ing and mitigating knowledge conflicts in language models

Zhuoran Jin, Pengfei Cao, Hongbang Yuan, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao. Cutting off the head ends the conflict: A mechanism for interpret- ing and mitigating knowledge conflicts in language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: A...

work page 2024

[8] [8]

doi: 10.18653/v1/2024.findings-acl.70

Association for Computational Linguis- tics. doi: 10.18653/v1/2024.findings-acl.70. URLhttps://aclanthology.org/2024. findings-acl.70/. Asher Koriat. The self-consistency model of subjective confidence.Psychological Review, 119: 80–113, 10

work page doi:10.18653/v1/2024.findings-acl.70 2024

[9] [9]

Brenden M

doi: 10.1037/a0025648. Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. The omniglot challenge: a 3-year progress report,

work page doi:10.1037/a0025648

[10] [10]

URLhttps://arxiv.org/abs/1902.03477. 3https://huggingface.co/docs/trl/index 10 Preprint Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks,

work page internal anchor Pith review Pith/arXiv arXiv 1902

[11] [11]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

URLhttps: //arxiv.org/abs/2005.11401. Gaotang Li, Yuzhong Chen, and Hanghang Tong. Taming knowledge conflicts in language models. InF orty-second International Conference on Machine Learning. Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of para...

work page internal anchor Pith review Pith/arXiv arXiv 2005

[12] [12]

Dis- entqa: Disentangling parametric and contextual knowledge with counterfactual question answer- ing.arXiv preprint arXiv:2211.05655,

Ella Neeman, Roee Aharoni, Or Honovich, Leshem Choshen, Idan Szpektor, and Omri Abend. Dis- entqa: Disentangling parametric and contextual knowledge with counterfactual question answer- ing.arXiv preprint arXiv:2211.05655,

work page arXiv

[13] [13]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

doi: 10.18653/v1/2024.acl-long.458

Association for Computational Lin- guistics. doi: 10.18653/v1/2024.acl-long.458. URLhttps://aclanthology.org/2024. acl-long.458/. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9,

work page doi:10.18653/v1/2024.acl-long.458 2024

[15] [15]

In-context retrieval-augmented language models

URLhttps://arxiv. org/abs/2302.00083. Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettle- moyer, and Wen tau Yih. Replug: Retrieval-augmented black-box language models,

work page arXiv

[16] [16]

REPLUG: Retrieval-Augmented Black-Box Language Models

URL https://arxiv.org/abs/2301.12652. Zhongxiang Sun, Xiaoxue Zang, Kai Zheng, Yang Song, Jun Xu, Xiao Zhang, Weijie Yu, Yang Song, and Han Li. Redeep: Detecting hallucination in retrieval-augmented generation via mechanistic interpretability,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al

URLhttps://arxiv.org/abs/2410.11414. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page arXiv

[18] [18]

URLhttps://arxiv.org/abs/2403. 08319. 11 Preprint Qinan Yu, Jack Merullo, and Ellie Pavlick. Characterizing mechanisms for factual recall in language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing, pp. 9924–9959, Singapore, Decem- ber

work page 2023

[19] [19]

doi: 10.18653/v1/2023.emnlp-main.615

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.615. URLhttps://aclanthology.org/2023.emnlp-main.615/. G.K. Zipf.Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Martino Fine Books,

work page doi:10.18653/v1/2023.emnlp-main.615 2023

[20] [20]

needle-in-the-haystack problem

ISBN 9781614273127. URLhttps://books.google.co. kr/books?id=nR06MAEACAAJ. Nicolas Zucchet, J¨org Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, and Soham De. How do language models learn facts? dynamics, curricula and hallucinations.arXiv preprint arXiv:2503.21676,

work page arXiv

[21] [21]

Each profile contains four attributes:birth date,birth city, university, andmajor

A SYNTHETICBIOGRAPHIESDATASETCONSTRUCTION Following prior work (Allen-Zhu & Li, 2024a; Zucchet et al., 2025), we first constructN synthetic person profiles. Each profile contains four attributes:birth date,birth city, university, andmajor. Names (first/middle/last) are sampled by randomly composing entries from a public name database.4 Forbirth date, we s...

work page 2025

[22] [22]

Component Value Embedding dimension 512 Layers 8 Attention heads 8 FFN inner dimension 2048 Context length 512 Table 4: Training hyperparameters

B DETAILS ONTRAININGLANGUAGEMODELS Table 3: Model architecture. Component Value Embedding dimension 512 Layers 8 Attention heads 8 FFN inner dimension 2048 Context length 512 Table 4: Training hyperparameters. Hyperparameter Value Max training steps 16,000 Batch size 128 Learning rate4×10 −4 Weight decay 0.10 LR scheduler Cosine Sequence length 512 Numeri...

work page 2048

[23] [23]

(2022), we adopt the settings used in Zucchet et al

Following Hoffmann et al. (2022), we adopt the settings used in Zucchet et al. (2025). The training hyperparameters are listed in Table

work page 2022

[24] [24]

November 10, 2079

6https://huggingface.co/openai-community/gpt2 13 Preprint C EXAMPLE OFFACTUALINCONSISTENCYNOISE WITHIN ADOCUMENT Figure 9 illustrates a document from the REPEATED+MIXcorpus in which factual inconsistency noise has been injected. The value highlighted in pink was injected as noise with some probability and therefore does not match the latter original value...

work page 2079

[25] [25]

Roselee Justine Woolem first opened their eyes in Phoenix, AZ

Annika Klara Wickizer was educated in the field of Information Systems.Roselee Justine Woolem gained academic grounding in Business Analytics. Roselee Justine Woolem first opened their eyes in Phoenix, AZ. Roselee Justine Woolem studied at Hamilton College. Roselee Justine Woolem was brought into the world on August 12, 2083.Roselee Justine Woolem entered...

work page 2083

[26] [26]

Roselee Justine Woolem began their life in Phoenix, AZ

Roselee Justine Woolem majored in Business Analytics. Roselee Justine Woolem began their life in Phoenix, AZ. Roselee Justine Woolem developed expertise at Hamilton College. Figure 9: Example of the document injected inconsistency noise D ADDITIONALEXPERIMENTALRESULTS We further examine the training dynamics by systematically varying several factors. Unle...

work page 2023