pith. sign in

arxiv: 2510.02370 · v3 · submitted 2025-09-29 · 💻 cs.CL · cs.AI

How Training Data Shapes the Use of Parametric and In-Context Knowledge in Language Models

Pith reviewed 2026-05-18 13:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords language modelsparametric knowledgein-context knowledgeknowledge arbitrationtraining datasynthetic corporapretraining
0
0 comments X

The pith

Balanced arbitration between parametric and in-context knowledge emerges only when training data combines intra-document repetition, moderate inconsistency, and skewed distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how pretraining data properties influence whether language models draw on their learned parametric knowledge or defer to information given in the prompt when the two conflict. Using controlled synthetic corpora, the experiments isolate three data traits that together produce reliable arbitration: repeated statements inside documents, a moderate amount of conflicting facts within documents, and uneven coverage of different facts across the corpus. These traits, usually minimized during data cleaning, turn out to be jointly necessary for models to show high-confidence use of parametric knowledge on familiar items and context use on less familiar ones. The authors further observe the same pattern in actual pretraining runs and track how later alignment steps modify the resulting behavior.

Core claim

When parametric and in-context knowledge conflict, models prefer parametric knowledge for high-confidence facts while deferring to context for less familiar ones. Controlled experiments with synthetic corpora show that this balanced arbitration arises as an emergent property precisely when the training data contains intra-document repetition, a moderate degree of intra-document inconsistency, and a skewed knowledge distribution. These three conditions, often viewed as detrimental, must occur together. The same dynamics appear in real-world language-model pretraining, and post-training procedures can reshape the arbitration strategies that result.

What carries the argument

Synthetic corpora that systematically vary intra-document repetition, intra-document inconsistency, and knowledge-distribution skew to measure resulting changes in model arbitration between parametric and in-context knowledge.

If this is right

  • Only the joint presence of repetition, moderate inconsistency, and skew produces robust balanced arbitration.
  • The same three data properties occur naturally during standard language-model pretraining.
  • Post-training procedures can shift the balance toward greater reliance on either parametric or in-context knowledge.
  • Training-data design should deliberately preserve moderate levels of these three properties to support reliable knowledge integration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Data pipelines that aggressively remove repetition and inconsistency may unintentionally weaken a model's ability to decide when to trust its own knowledge.
  • Controlled injection of moderate inconsistency during pretraining could be tested as a lightweight way to improve context utilization without adding parameters.
  • The finding suggests a possible link between data skew and reduced hallucination rates on high-confidence facts.
  • Scaling the same synthetic-corpus design to larger models would test whether the three-factor requirement holds beyond the studied regime.

Load-bearing premise

The specific data properties isolated in the synthetic-corpora experiments are the causal drivers of knowledge-arbitration behavior in large-scale real-world pretraining rather than artifacts of the controlled setup.

What would settle it

Train otherwise identical models on synthetic corpora that each omit one of the three factors and check whether balanced arbitration between parametric and in-context knowledge disappears in every case that lacks the full combination.

Figures

Figures reproduced from arXiv: 2510.02370 by Dong-Kyum Kim, Jea Kwon, Kyomin Jung, Meeyoung Cha, Minsung Kim, Nakyeong Yang.

Figure 1
Figure 1. Figure 1: Three knowledge utilization scenarios. Left: parametric knowledge utilization where the model recalls knowledge encoded in its parameters and answers queries about entities seen during training. Middle: in-context knowledge utilization where the model extracts and uses knowledge provided only in the prompt and is evaluated on novel entities not seen during training. Right: knowledge conflict resolution whe… view at source ↗
Figure 2
Figure 2. Figure 2: An example of intra-document repetition of key attributes (e.g., German, Physics) for [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy of parametric knowledge utilization (Acc [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Training dynamics of AccICKU, AccPKU, Pref ICK, and PrefPK when trained on the REPEATED+MIX corpus without noise (Left) and with 1% noise (Right). When the training corpus contains no noise (i.e., no inconsistent knowledge within the same documents), the model consis￾tently prefers in-context knowledge in knowledge conflicts, whereas even a small amount of noise induces a phase shift toward parametric … view at source ↗
Figure 5
Figure 5. Figure 5: AccPKU, Pref ICK, and PrefPK for the top 10% (high-frequency) and bottom 10% (low￾frequency) entities in the training corpus. For high-frequency entities, Pref ICK is initially higher but gradually yields to PrefPK; for low-frequency entities, Pref ICK remains consistently higher. p ∈ {1%, 5%, 10%}, replacing them with randomly sampled alternatives, and leave the later para￾graph unchanged ( [PITH_FULL_IM… view at source ↗
Figure 6
Figure 6. Figure 6: Bar plots of PrefPK under knowledge conflict (red) and mean entropy in the parametric￾knowledge-utilization setting (blue) Bins are ordered by Zipfian rank, where lower rank denotes higher frequency. Left: Results with zipfian training corpus without inconsistency noise. Right: Results with zipfian training corpus with a small amount(1%) of inconsistency noise. We further evaluated preference measures acro… view at source ↗
Figure 7
Figure 7. Figure 7: AccICKU, AccPKU, Pref ICK, and PrefPK in Pythia checkpoints. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: An example of the synthetic dataset. Each profile consists of four attributes ( [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of the document injected inconsistency noise [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Training dynamics of AccICKU and AccPKU under different numbers of training entities [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Training dynamics of AccICKU and AccPKU under different levels of intra-document inconsistency noise [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Training dynamics of AccICKU and AccPKU as a function of the Zipf exponent α [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Three knowledge utilization scenarios in real-world large language models. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
read the original abstract

Large language models leverage both parametric knowledge acquired during pretraining and in-context knowledge provided at inference time. Crucially, when these sources conflict, models arbitrate based on their internal confidence, preferring parametric knowledge for high-confidence facts while deferring to context for less familiar ones. However, the training conditions that give rise to these fundamental behaviors remain unclear. Here we conduct controlled experiments using synthetic corpora to identify the specific data properties that shape knowledge utilization. Our results reveal a counterintuitive finding: the robust, balanced use of both knowledge sources is an emergent property that requires the co-occurrence of three factors typically considered detrimental, including (i) intra-document repetition, (ii) a moderate degree of intra-document inconsistency, and (iii) a skewed knowledge distribution. We further show that these dynamics arise in real-world language model pretraining and analyze how post-training procedures reshape arbitration strategies. Together, our findings provide empirical guidance for designing training data that supports the reliable integration of parametric and in-context knowledge in language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that large language models arbitrate between parametric knowledge (from pretraining) and in-context knowledge (at inference) based on internal confidence, and that the robust, balanced use of both sources is an emergent property requiring the co-occurrence of three training-data factors: intra-document repetition, moderate intra-document inconsistency, and skewed knowledge distribution. These conclusions are drawn from controlled experiments on synthetic corpora, with supporting observations from real-world pretraining corpora and analysis of how post-training procedures alter arbitration strategies.

Significance. If the central results hold, the work supplies concrete empirical guidance for constructing pretraining data that promotes reliable integration of parametric and in-context knowledge. The controlled synthetic design is a clear methodological asset, as it permits isolation of the three targeted data properties. The counterintuitive finding that repetition and moderate inconsistency can be beneficial rather than detrimental could influence future data-curation practices.

major comments (1)
  1. [Real-world pretraining analysis] The section on real-world language model pretraining reports that similar arbitration patterns appear in existing pretrained models or corpora but does not perform interventions that manipulate intra-document repetition, inconsistency, or knowledge skew while holding other statistics (document length, topic diversity, token frequencies) fixed. Consequently the observational evidence cannot establish that the three synthetic factors are the causal drivers rather than artifacts of uncontrolled covariates.
minor comments (2)
  1. [Abstract and experimental methods] The abstract and experimental sections provide no details on sample sizes, statistical controls, or criteria for post-hoc analysis choices, which limits assessment of the reliability of the reported patterns.
  2. [Synthetic corpus construction] Precise operational definitions and quantitative metrics for 'intra-document repetition,' 'moderate inconsistency,' and 'skewed knowledge distribution' should be stated explicitly when describing the synthetic corpus construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the manuscript's significance and methodology. We address the major comment below, clarifying the respective roles of our controlled experiments and observational analysis.

read point-by-point responses
  1. Referee: [Real-world pretraining analysis] The section on real-world language model pretraining reports that similar arbitration patterns appear in existing pretrained models or corpora but does not perform interventions that manipulate intra-document repetition, inconsistency, or knowledge skew while holding other statistics (document length, topic diversity, token frequencies) fixed. Consequently the observational evidence cannot establish that the three synthetic factors are the causal drivers rather than artifacts of uncontrolled covariates.

    Authors: We agree that the real-world analysis is observational and does not involve interventions that manipulate the three factors while holding all other statistics fixed. Our causal claims rest on the controlled synthetic corpus experiments (Sections 3 and 4), where intra-document repetition, moderate inconsistency, and knowledge skew are varied independently with other variables held constant. The Section 5 analysis demonstrates that the same arbitration patterns emerge in existing pretrained models and natural corpora, providing evidence of ecological validity rather than independent causal proof. In revision we will add explicit language to Section 5 and the associated figure captions stating that this component is correlational and that causality is established via the synthetic controls. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical experimental chain

full rationale

The paper's claims rest on controlled variation of data properties (repetition, inconsistency, distribution skew) across synthetic corpora, with results reported as emergent from those manipulations. No equations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described methods. Real-world analysis is observational comparison rather than a derivation that reduces to the synthetic inputs by construction. The work is self-contained empirical science without load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that synthetic data manipulations isolate the true causal factors governing real pretraining dynamics.

axioms (1)
  • domain assumption Synthetic corpora manipulations isolate the causal effects of repetition, inconsistency, and skew on knowledge arbitration in a manner that generalizes to natural language pretraining.
    Invoked to treat the controlled experiments as revealing fundamental training conditions rather than setup-specific artifacts.

pith-pipeline@v0.9.0 · 5720 in / 1207 out tokens · 46500 ms · 2026-05-18T13:06:27.821165+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation

    cs.CL 2026-05 conditional novelty 6.0

    A three-regime framework resolves contradictions in LLM context vs. parametric knowledge conflicts by distinguishing single-source updating, competitive integration, and task-appropriate selection, with empirical conf...

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Physics of language models: Part 3.1, knowledge storage and extraction

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction, 2024a. URLhttps://arxiv.org/abs/2309.14316. Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manipulation, 2024b. URLhttps://arxiv.org/abs/2309.14402. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie B...

  2. [2]

    Language Models are Few-Shot Learners

    URL https://arxiv.org/abs/2005.14165. Stephanie Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, and Felix Hill. Data distributional properties drive emergent in-context learn- ing in transformers.Advances in neural information processing systems, 35:18878–18891,

  3. [3]

    Transformer Feed-Forward Layers Are Key-Value Memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories.arXiv preprint arXiv:2012.14913,

  4. [4]

    arXiv preprint arXiv:2304.14767 , year=

    Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models.arXiv preprint arXiv:2304.14767,

  5. [5]

    Linearity of relation decoding in transformer language models

    Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. arXiv preprint arXiv:2308.09124,

  6. [6]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  7. [7]

    Cutting off the head ends the conflict: A mechanism for interpret- ing and mitigating knowledge conflicts in language models

    Zhuoran Jin, Pengfei Cao, Hongbang Yuan, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao. Cutting off the head ends the conflict: A mechanism for interpret- ing and mitigating knowledge conflicts in language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: A...

  8. [8]

    doi: 10.18653/v1/2024.findings-acl.70

    Association for Computational Linguis- tics. doi: 10.18653/v1/2024.findings-acl.70. URLhttps://aclanthology.org/2024. findings-acl.70/. Asher Koriat. The self-consistency model of subjective confidence.Psychological Review, 119: 80–113, 10

  9. [9]

    Brenden M

    doi: 10.1037/a0025648. Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. The omniglot challenge: a 3-year progress report,

  10. [10]

    URLhttps://arxiv.org/abs/1902.03477. 3https://huggingface.co/docs/trl/index 10 Preprint Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks,

  11. [11]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    URLhttps: //arxiv.org/abs/2005.11401. Gaotang Li, Yuzhong Chen, and Hanghang Tong. Taming knowledge conflicts in language models. InF orty-second International Conference on Machine Learning. Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of para...

  12. [12]

    Dis- entqa: Disentangling parametric and contextual knowledge with counterfactual question answer- ing.arXiv preprint arXiv:2211.05655,

    Ella Neeman, Roee Aharoni, Or Honovich, Leshem Choshen, Idan Szpektor, and Omri Abend. Dis- entqa: Disentangling parametric and contextual knowledge with counterfactual question answer- ing.arXiv preprint arXiv:2211.05655,

  13. [13]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895,

  14. [14]

    doi: 10.18653/v1/2024.acl-long.458

    Association for Computational Lin- guistics. doi: 10.18653/v1/2024.acl-long.458. URLhttps://aclanthology.org/2024. acl-long.458/. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9,

  15. [15]

    In-context retrieval-augmented language models

    URLhttps://arxiv. org/abs/2302.00083. Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettle- moyer, and Wen tau Yih. Replug: Retrieval-augmented black-box language models,

  16. [16]

    REPLUG: Retrieval-Augmented Black-Box Language Models

    URL https://arxiv.org/abs/2301.12652. Zhongxiang Sun, Xiaoxue Zang, Kai Zheng, Yang Song, Jun Xu, Xiao Zhang, Weijie Yu, Yang Song, and Han Li. Redeep: Detecting hallucination in retrieval-augmented generation via mechanistic interpretability,

  17. [17]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al

    URLhttps://arxiv.org/abs/2410.11414. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

  18. [18]

    URLhttps://arxiv.org/abs/2403. 08319. 11 Preprint Qinan Yu, Jack Merullo, and Ellie Pavlick. Characterizing mechanisms for factual recall in language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing, pp. 9924–9959, Singapore, Decem- ber

  19. [19]

    doi: 10.18653/v1/2023.emnlp-main.615

    Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.615. URLhttps://aclanthology.org/2023.emnlp-main.615/. G.K. Zipf.Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Martino Fine Books,

  20. [20]

    needle-in-the-haystack problem

    ISBN 9781614273127. URLhttps://books.google.co. kr/books?id=nR06MAEACAAJ. Nicolas Zucchet, J¨org Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, and Soham De. How do language models learn facts? dynamics, curricula and hallucinations.arXiv preprint arXiv:2503.21676,

  21. [21]

    Each profile contains four attributes:birth date,birth city, university, andmajor

    A SYNTHETICBIOGRAPHIESDATASETCONSTRUCTION Following prior work (Allen-Zhu & Li, 2024a; Zucchet et al., 2025), we first constructN synthetic person profiles. Each profile contains four attributes:birth date,birth city, university, andmajor. Names (first/middle/last) are sampled by randomly composing entries from a public name database.4 Forbirth date, we s...

  22. [22]

    Component Value Embedding dimension 512 Layers 8 Attention heads 8 FFN inner dimension 2048 Context length 512 Table 4: Training hyperparameters

    B DETAILS ONTRAININGLANGUAGEMODELS Table 3: Model architecture. Component Value Embedding dimension 512 Layers 8 Attention heads 8 FFN inner dimension 2048 Context length 512 Table 4: Training hyperparameters. Hyperparameter Value Max training steps 16,000 Batch size 128 Learning rate4×10 −4 Weight decay 0.10 LR scheduler Cosine Sequence length 512 Numeri...

  23. [23]

    (2022), we adopt the settings used in Zucchet et al

    Following Hoffmann et al. (2022), we adopt the settings used in Zucchet et al. (2025). The training hyperparameters are listed in Table

  24. [24]

    November 10, 2079

    6https://huggingface.co/openai-community/gpt2 13 Preprint C EXAMPLE OFFACTUALINCONSISTENCYNOISE WITHIN ADOCUMENT Figure 9 illustrates a document from the REPEATED+MIXcorpus in which factual inconsistency noise has been injected. The value highlighted in pink was injected as noise with some probability and therefore does not match the latter original value...

  25. [25]

    Roselee Justine Woolem first opened their eyes in Phoenix, AZ

    Annika Klara Wickizer was educated in the field of Information Systems.Roselee Justine Woolem gained academic grounding in Business Analytics. Roselee Justine Woolem first opened their eyes in Phoenix, AZ. Roselee Justine Woolem studied at Hamilton College. Roselee Justine Woolem was brought into the world on August 12, 2083.Roselee Justine Woolem entered...

  26. [26]

    Roselee Justine Woolem began their life in Phoenix, AZ

    Roselee Justine Woolem majored in Business Analytics. Roselee Justine Woolem began their life in Phoenix, AZ. Roselee Justine Woolem developed expertise at Hamilton College. Figure 9: Example of the document injected inconsistency noise D ADDITIONALEXPERIMENTALRESULTS We further examine the training dynamics by systematically varying several factors. Unle...