pith. sign in

arxiv: 2602.20122 · v2 · submitted 2026-02-23 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

NanoKnow: How to Know What Your Language Model Knows

Pith reviewed 2026-05-15 20:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG
keywords language modelsparametric knowledgeretrieval augmentationbenchmark datasetpre-training dataclosed-book QAexternal context
0
0 comments X

The pith

A benchmark splits questions by whether their answers appear in a model's pre-training data to separate memorized facts from evidence-based answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces NanoKnow, a dataset that divides questions from standard benchmarks into groups depending on whether their answers were present in the pre-training corpus of the nanochat language models. This split allows experiments to measure how much models rely on internal parametric knowledge versus external context provided at inference time. The results indicate that closed-book performance tracks how often answers appeared in training, that adding evidence reduces but does not eliminate this dependence, and that irrelevant context hurts accuracy in ways that depend on its position and quantity. These patterns matter because they clarify when retrieval augmentation can or cannot substitute for training data.

Core claim

By releasing NanoKnow and testing eight nanochat checkpoints, the work shows that closed-book accuracy depends strongly on answer frequency in pre-training, external evidence mitigates frequency effects but parametric knowledge remains complementary even then, and non-relevant contexts degrade performance based on their count and placement.

What carries the argument

NanoKnow benchmark, which partitions Natural Questions and SQuAD items into splits according to the presence of answer strings in nanochat's open pre-training corpus.

If this is right

  • Closed-book accuracy rises with the frequency of the answer string in the pre-training data.
  • Adding relevant external evidence reduces dependence on pre-training frequency but does not remove it entirely.
  • Parametric and retrieved knowledge act as complements rather than substitutes.
  • Inserting non-relevant contexts lowers accuracy, with larger drops when they appear earlier or in greater numbers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of retrieval systems may need to filter out distractors more carefully than current pipelines do.
  • Similar splits could be applied to other open models to test whether the frequency and complementarity patterns generalize.
  • Future work could replace string matching with more precise probes of what the model actually memorized during training.

Load-bearing premise

That the exact presence of an answer string in the pre-training corpus reliably indicates whether the model has encoded the corresponding fact in its parameters.

What would settle it

An experiment that measures whether a model can correctly answer questions whose answers never appeared in pre-training even after many training epochs, or that checks accuracy on paraphrased answers not matching the exact string.

Figures

Figures reproduced from arXiv: 2602.20122 by Jimmy Lin, Lingwei Gu, Nour Jedidi.

Figure 1
Figure 1. Figure 1: An example of NanoKnow on a question-answer pair. If any passage is deemed to answer the question after the string [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of NanoKnow’s supported questions [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Influence of pre-training data answer frequency [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a "black box" - unknown or inaccessible. The recent release of nanochat - a family of small LLMs with fully open pre-training data - addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre-training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output. To demonstrate NanoKnow's utility, we conduct experiments using eight nanochat checkpoints. Our findings show: (1) closed-book accuracy is strongly influenced by answer frequency in the pre-training data, (2) providing external evidence can mitigate this frequency dependence, (3) even with external evidence, models are more accurate when answers were seen during pre-training, demonstrating that parametric and external knowledge are complementary, and (4) non-relevant information is harmful, with accuracy decreasing based on both the position and the number of non-relevant contexts. We release all NanoKnow artifacts at https://github.com/castorini/NanoKnow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces NanoKnow, a benchmark that partitions questions from Natural Questions and SQuAD according to whether their answers occur exactly in the pre-training corpus of the nanochat model family. Experiments across eight nanochat checkpoints are used to support four findings: closed-book accuracy is strongly modulated by answer frequency in pre-training data; external evidence mitigates this frequency effect; models remain more accurate with evidence when answers were seen during pre-training (indicating complementarity between parametric and external knowledge); and non-relevant contexts reduce accuracy in a manner dependent on both their position and count. All artifacts are released publicly.

Significance. If the presence-based splits provide a valid separation of parametric knowledge, the work supplies a reproducible, transparent method for studying knowledge sources in LLMs that is otherwise hindered by closed pre-training corpora. The consistent patterns observed across multiple checkpoints and the public release of the benchmark and code constitute clear strengths that would enable follow-on research on knowledge integration.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Benchmark Construction): The interpretation of findings (1) and (3) as direct evidence of parametric knowledge rests on the assumption that exact answer-string presence in the pre-training corpus is a sufficient proxy for the model having encoded the corresponding fact. This mapping is load-bearing yet the manuscript provides no discussion of potential confounds such as minimum exposure frequency, tokenization fragmentation, data duplication, or context-dependent memorization; if many 'present' instances are not actually encoded, the reported frequency dependence and residual advantage with evidence become harder to attribute specifically to parametric knowledge.
  2. [§4] §4 (Experiments): The reported accuracy differences between splits are described as consistent across checkpoints, but the manuscript does not include statistical significance tests (e.g., paired t-tests, bootstrap confidence intervals, or p-values) for the key contrasts. Without these, it is difficult to determine whether the observed gaps exceed what would be expected from sampling variability alone, weakening support for the frequency-dependence and complementarity claims.
minor comments (2)
  1. [Abstract] Abstract: Additional detail on split construction (exact matching procedure, any length or frequency filters applied, and the resulting split sizes) would improve reproducibility and allow readers to assess the proxy's coverage.
  2. [§4] Figure captions and §4: Ensure all plots explicitly label the 'present' vs. 'absent' conditions and report the number of examples per condition so that effect sizes can be interpreted in context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive feedback. We address the two major comments point by point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): The interpretation of findings (1) and (3) as direct evidence of parametric knowledge rests on the assumption that exact answer-string presence in the pre-training corpus is a sufficient proxy for the model having encoded the corresponding fact. This mapping is load-bearing yet the manuscript provides no discussion of potential confounds such as minimum exposure frequency, tokenization fragmentation, data duplication, or context-dependent memorization; if many 'present' instances are not actually encoded, the reported frequency dependence and residual advantage with evidence become harder to attribute specifically to parametric knowledge.

    Authors: We agree that exact string presence functions as a proxy rather than a direct guarantee of encoding. Although nanochat's fully open pre-training corpus permits precise detection of answer-string occurrences, we acknowledge that factors such as minimum frequency, tokenization effects, duplication, and context-dependent memorization are not addressed in the current draft. In the revised manuscript we will add a dedicated limitations paragraph in §3 that explicitly discusses these confounds and qualifies the interpretation of findings (1) and (3) as evidence conditioned on this proxy. revision: yes

  2. Referee: [§4] §4 (Experiments): The reported accuracy differences between splits are described as consistent across checkpoints, but the manuscript does not include statistical significance tests (e.g., paired t-tests, bootstrap confidence intervals, or p-values) for the key contrasts. Without these, it is difficult to determine whether the observed gaps exceed what would be expected from sampling variability alone, weakening support for the frequency-dependence and complementarity claims.

    Authors: We thank the referee for this observation. In the revised version we will report bootstrap confidence intervals (1,000 resamples) and paired t-tests (or Wilcoxon signed-rank tests where normality assumptions are violated) for the primary accuracy contrasts between presence-based splits, both within and across the eight checkpoints. These statistics will be added to §4 and the corresponding figures/tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark with direct measurements

full rationale

The paper releases NanoKnow splits defined by exact answer-string presence in the nanochat pre-training corpus and reports measured accuracies for closed-book vs. evidence-augmented settings across eight checkpoints. No equations, derivations, or 'predictions' appear; the central claims are direct empirical observations on the constructed splits. The proxy (string presence) is an explicit methodological choice whose validity can be evaluated externally, but it does not create a self-referential loop or rename a fitted quantity as a prediction. No uniqueness theorems, ansatzes, or self-citation chains are invoked to force the results. The work is therefore self-contained as an empirical benchmark release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that string presence in the open corpus accurately indicates parametric encoding; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Exact string occurrence of an answer in the nanochat pre-training corpus is a valid proxy for whether the model has memorized the corresponding fact.
    Used to create the presence/absence splits that drive all four findings.

pith-pipeline@v0.9.0 · 5545 in / 1259 out tokens · 28906 ms · 2026-05-15T20:20:49.756960+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

  1. [1]

    Ekin Akyürek, Tolga Bolukbasi, Frederick Liu, Binbin Xiong, Ian Tenney, Jacob Andreas, and Kelvin Guu. 2022. Towards Tracing Knowledge in Language Models Back to the Training Data. InFindings of the Association for Computational Linguistics: EMNLP 2022. 2429–2446

  2. [2]

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. InInternational Conference on Machine Learning. 2397–2430

  3. [3]

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying Memorization Across Neural Language Models. InThe Eleventh International Conference on Learning Represen- tations

  4. [4]

    Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney

    Tyler A. Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney

  5. [5]

    InThe Thirteenth International Conference on Learning Representations

    Scalable Influence and Fact Tracing for Large Language Model Pretraining. InThe Thirteenth International Conference on Learning Representations

  6. [6]

    Si Chen, Feiyang Kang, Ning Yu, and Ruoxi Jia. 2024. FASTTRACK: Reliable Fact Tracing via Clustering and LLM-Powered Evidence Validation. InFindings of the Association for Computational Linguistics: EMNLP 2024. 5821–5836

  7. [7]

    Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The Power of Noise: Redefining Retrieval for RAG Systems. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024). 719–729

  8. [8]

    Mehrdad Farahani and Richard Johansson. 2024. Deciphering the Interplay of Parametric and Non-Parametric Memory in Retrieval-Augmented Language Models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). 16966–16977

  9. [9]

    Xu, Jun Araki, and Graham Neubig

    Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. How Can We Know What Language Models Know?Transactions of the Association for Computational Linguistics8 (2020), 423–438

  10. [10]

    Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel

  11. [11]

    InInter- national Conference on Machine Learning

    Large Language Models Struggle to Learn Long-Tail Knowledge. InInter- national Conference on Machine Learning. 15696–15707

  12. [12]

    Andrej Karpathy. 2025. nanochat: The Best ChatGPT That $100 Can Buy. https: //github.com/karpathy/nanochat

  13. [13]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research.Tr...

  14. [14]

    Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Infor- mation Retrieval Research with Sparse and Dense Representations. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 2356–2362

  15. [15]

    Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, and Jesse Dodge

    Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, Yen-Sung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cas- sidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Yu-Yen Cheng, Karen Farley, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pan...

  16. [16]

    InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 3: System Demonstrations)

    OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 3: System Demonstrations). 178–188

  17. [17]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173

  18. [18]

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9802–9822

  19. [19]

    Guilherme Penedo, Hynek Kydlíček, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. InProceedings of the 38th International Conference on Neural Information Processing Systems

  20. [20]

    Merge Conflicts!’

    Cheng Qian, Xinran Zhao, and Tongshuang Wu. 2024. "Merge Conflicts!’" Ex- ploring the Impacts of External Knowledge Distractors to Parametric Knowledge Graphs. InFirst Conference on Language Modeling

  21. [21]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InProceed- ings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2383–2392

  22. [22]

    Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How Much Knowledge Can You Pack Into the Parameters of a Language Model?. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 5418–5426

  23. [23]

    Weiwei Sun, Haokun Liu, Nikhil Kandpal, Colin Raffel, and Yiming Yang. 2025. Enhancing Training Data Attribution with Representational Optimization.arXiv preprint arXiv:2505.18513(2025)

  24. [24]

    Markosyan, Luke Zettlemoyer, and Armen Agha- janyan

    Kushal Tirumala, Aram H. Markosyan, Luke Zettlemoyer, and Armen Agha- janyan. 2022. Memorization Without Overfitting: Analyzing the Training Dynam- ics of Large Language Models. InProceedings of the 36th International Conference on Neural Information Processing Systems

  25. [25]

    Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, and William Yang Wang. 2025. Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data. InThe Thirteenth International Conference on Learning Representations

  26. [26]

    Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. 2023. Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts. InThe Twelfth International Conference on Learning Representations

  27. [27]

    Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. 2024. Knowledge Conflicts for LLMs: A Survey. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). 8541–8565

  28. [28]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)

  29. [29]

    Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the Use of Lucene for Information Retrieval Research. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017). 1253–1256

  30. [30]

    Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel

  31. [31]

    In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Do Large Language Models Latently Perform Multi-Hop Reasoning?. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 10210–10229

  32. [32]

    Jun Zhao, Yongzhuo Yang, Xiang Hu, Jingqi Tong, Yi Lu, Wei Wu, Tao Gui, Qi Zhang, and Xuanjing Huang. 2025. Understanding Parametric and Contextual Knowledge Reconciliation within Large Language Models. InProceedings of the 39th International Conference on Neural Information Processing Systems