NanoKnow: How to Know What Your Language Model Knows
Pith reviewed 2026-05-15 20:20 UTC · model grok-4.3
The pith
A benchmark splits questions by whether their answers appear in a model's pre-training data to separate memorized facts from evidence-based answers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By releasing NanoKnow and testing eight nanochat checkpoints, the work shows that closed-book accuracy depends strongly on answer frequency in pre-training, external evidence mitigates frequency effects but parametric knowledge remains complementary even then, and non-relevant contexts degrade performance based on their count and placement.
What carries the argument
NanoKnow benchmark, which partitions Natural Questions and SQuAD items into splits according to the presence of answer strings in nanochat's open pre-training corpus.
If this is right
- Closed-book accuracy rises with the frequency of the answer string in the pre-training data.
- Adding relevant external evidence reduces dependence on pre-training frequency but does not remove it entirely.
- Parametric and retrieved knowledge act as complements rather than substitutes.
- Inserting non-relevant contexts lowers accuracy, with larger drops when they appear earlier or in greater numbers.
Where Pith is reading between the lines
- Designers of retrieval systems may need to filter out distractors more carefully than current pipelines do.
- Similar splits could be applied to other open models to test whether the frequency and complementarity patterns generalize.
- Future work could replace string matching with more precise probes of what the model actually memorized during training.
Load-bearing premise
That the exact presence of an answer string in the pre-training corpus reliably indicates whether the model has encoded the corresponding fact in its parameters.
What would settle it
An experiment that measures whether a model can correctly answer questions whose answers never appeared in pre-training even after many training epochs, or that checks accuracy on paraphrased answers not matching the exact string.
Figures
read the original abstract
How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a "black box" - unknown or inaccessible. The recent release of nanochat - a family of small LLMs with fully open pre-training data - addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre-training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output. To demonstrate NanoKnow's utility, we conduct experiments using eight nanochat checkpoints. Our findings show: (1) closed-book accuracy is strongly influenced by answer frequency in the pre-training data, (2) providing external evidence can mitigate this frequency dependence, (3) even with external evidence, models are more accurate when answers were seen during pre-training, demonstrating that parametric and external knowledge are complementary, and (4) non-relevant information is harmful, with accuracy decreasing based on both the position and the number of non-relevant contexts. We release all NanoKnow artifacts at https://github.com/castorini/NanoKnow.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces NanoKnow, a benchmark that partitions questions from Natural Questions and SQuAD according to whether their answers occur exactly in the pre-training corpus of the nanochat model family. Experiments across eight nanochat checkpoints are used to support four findings: closed-book accuracy is strongly modulated by answer frequency in pre-training data; external evidence mitigates this frequency effect; models remain more accurate with evidence when answers were seen during pre-training (indicating complementarity between parametric and external knowledge); and non-relevant contexts reduce accuracy in a manner dependent on both their position and count. All artifacts are released publicly.
Significance. If the presence-based splits provide a valid separation of parametric knowledge, the work supplies a reproducible, transparent method for studying knowledge sources in LLMs that is otherwise hindered by closed pre-training corpora. The consistent patterns observed across multiple checkpoints and the public release of the benchmark and code constitute clear strengths that would enable follow-on research on knowledge integration.
major comments (2)
- [Abstract and §3] Abstract and §3 (Benchmark Construction): The interpretation of findings (1) and (3) as direct evidence of parametric knowledge rests on the assumption that exact answer-string presence in the pre-training corpus is a sufficient proxy for the model having encoded the corresponding fact. This mapping is load-bearing yet the manuscript provides no discussion of potential confounds such as minimum exposure frequency, tokenization fragmentation, data duplication, or context-dependent memorization; if many 'present' instances are not actually encoded, the reported frequency dependence and residual advantage with evidence become harder to attribute specifically to parametric knowledge.
- [§4] §4 (Experiments): The reported accuracy differences between splits are described as consistent across checkpoints, but the manuscript does not include statistical significance tests (e.g., paired t-tests, bootstrap confidence intervals, or p-values) for the key contrasts. Without these, it is difficult to determine whether the observed gaps exceed what would be expected from sampling variability alone, weakening support for the frequency-dependence and complementarity claims.
minor comments (2)
- [Abstract] Abstract: Additional detail on split construction (exact matching procedure, any length or frequency filters applied, and the resulting split sizes) would improve reproducibility and allow readers to assess the proxy's coverage.
- [§4] Figure captions and §4: Ensure all plots explicitly label the 'present' vs. 'absent' conditions and report the number of examples per condition so that effect sizes can be interpreted in context.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and constructive feedback. We address the two major comments point by point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): The interpretation of findings (1) and (3) as direct evidence of parametric knowledge rests on the assumption that exact answer-string presence in the pre-training corpus is a sufficient proxy for the model having encoded the corresponding fact. This mapping is load-bearing yet the manuscript provides no discussion of potential confounds such as minimum exposure frequency, tokenization fragmentation, data duplication, or context-dependent memorization; if many 'present' instances are not actually encoded, the reported frequency dependence and residual advantage with evidence become harder to attribute specifically to parametric knowledge.
Authors: We agree that exact string presence functions as a proxy rather than a direct guarantee of encoding. Although nanochat's fully open pre-training corpus permits precise detection of answer-string occurrences, we acknowledge that factors such as minimum frequency, tokenization effects, duplication, and context-dependent memorization are not addressed in the current draft. In the revised manuscript we will add a dedicated limitations paragraph in §3 that explicitly discusses these confounds and qualifies the interpretation of findings (1) and (3) as evidence conditioned on this proxy. revision: yes
-
Referee: [§4] §4 (Experiments): The reported accuracy differences between splits are described as consistent across checkpoints, but the manuscript does not include statistical significance tests (e.g., paired t-tests, bootstrap confidence intervals, or p-values) for the key contrasts. Without these, it is difficult to determine whether the observed gaps exceed what would be expected from sampling variability alone, weakening support for the frequency-dependence and complementarity claims.
Authors: We thank the referee for this observation. In the revised version we will report bootstrap confidence intervals (1,000 resamples) and paired t-tests (or Wilcoxon signed-rank tests where normality assumptions are violated) for the primary accuracy contrasts between presence-based splits, both within and across the eight checkpoints. These statistics will be added to §4 and the corresponding figures/tables. revision: yes
Circularity Check
No significant circularity: empirical benchmark with direct measurements
full rationale
The paper releases NanoKnow splits defined by exact answer-string presence in the nanochat pre-training corpus and reports measured accuracies for closed-book vs. evidence-augmented settings across eight checkpoints. No equations, derivations, or 'predictions' appear; the central claims are direct empirical observations on the constructed splits. The proxy (string presence) is an explicit methodological choice whose validity can be evaluated externally, but it does not create a self-referential loop or rename a fitted quantity as a prediction. No uniqueness theorems, ansatzes, or self-citation chains are invoked to force the results. The work is therefore self-contained as an empirical benchmark release.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Exact string occurrence of an answer in the nanochat pre-training corpus is a valid proxy for whether the model has memorized the corresponding fact.
Reference graph
Works this paper leans on
-
[1]
Ekin Akyürek, Tolga Bolukbasi, Frederick Liu, Binbin Xiong, Ian Tenney, Jacob Andreas, and Kelvin Guu. 2022. Towards Tracing Knowledge in Language Models Back to the Training Data. InFindings of the Association for Computational Linguistics: EMNLP 2022. 2429–2446
work page 2022
-
[2]
Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. InInternational Conference on Machine Learning. 2397–2430
work page 2023
-
[3]
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying Memorization Across Neural Language Models. InThe Eleventh International Conference on Learning Represen- tations
work page 2022
-
[4]
Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney
Tyler A. Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney
-
[5]
InThe Thirteenth International Conference on Learning Representations
Scalable Influence and Fact Tracing for Large Language Model Pretraining. InThe Thirteenth International Conference on Learning Representations
-
[6]
Si Chen, Feiyang Kang, Ning Yu, and Ruoxi Jia. 2024. FASTTRACK: Reliable Fact Tracing via Clustering and LLM-Powered Evidence Validation. InFindings of the Association for Computational Linguistics: EMNLP 2024. 5821–5836
work page 2024
-
[7]
Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The Power of Noise: Redefining Retrieval for RAG Systems. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024). 719–729
work page 2024
-
[8]
Mehrdad Farahani and Richard Johansson. 2024. Deciphering the Interplay of Parametric and Non-Parametric Memory in Retrieval-Augmented Language Models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). 16966–16977
work page 2024
-
[9]
Xu, Jun Araki, and Graham Neubig
Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. How Can We Know What Language Models Know?Transactions of the Association for Computational Linguistics8 (2020), 423–438
work page 2020
-
[10]
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel
-
[11]
InInter- national Conference on Machine Learning
Large Language Models Struggle to Learn Long-Tail Knowledge. InInter- national Conference on Machine Learning. 15696–15707
-
[12]
Andrej Karpathy. 2025. nanochat: The Best ChatGPT That $100 Can Buy. https: //github.com/karpathy/nanochat
work page 2025
-
[13]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research.Tr...
work page 2019
-
[14]
Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Infor- mation Retrieval Research with Sparse and Dense Representations. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 2356–2362
work page 2021
-
[15]
Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, and Jesse Dodge
Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, Yen-Sung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cas- sidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Yu-Yen Cheng, Karen Farley, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pan...
-
[16]
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens. InProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 3: System Demonstrations). 178–188
-
[17]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173
work page 2024
-
[18]
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9802–9822
work page 2023
-
[19]
Guilherme Penedo, Hynek Kydlíček, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. InProceedings of the 38th International Conference on Neural Information Processing Systems
work page 2024
-
[20]
Cheng Qian, Xinran Zhao, and Tongshuang Wu. 2024. "Merge Conflicts!’" Ex- ploring the Impacts of External Knowledge Distractors to Parametric Knowledge Graphs. InFirst Conference on Language Modeling
work page 2024
-
[21]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InProceed- ings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2383–2392
work page 2016
-
[22]
Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How Much Knowledge Can You Pack Into the Parameters of a Language Model?. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 5418–5426
work page 2020
- [23]
-
[24]
Markosyan, Luke Zettlemoyer, and Armen Agha- janyan
Kushal Tirumala, Aram H. Markosyan, Luke Zettlemoyer, and Armen Agha- janyan. 2022. Memorization Without Overfitting: Analyzing the Training Dynam- ics of Large Language Models. InProceedings of the 36th International Conference on Neural Information Processing Systems
work page 2022
-
[25]
Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, and William Yang Wang. 2025. Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data. InThe Thirteenth International Conference on Learning Representations
work page 2025
-
[26]
Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. 2023. Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts. InThe Twelfth International Conference on Learning Representations
work page 2023
-
[27]
Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. 2024. Knowledge Conflicts for LLMs: A Survey. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). 8541–8565
work page 2024
-
[28]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the Use of Lucene for Information Retrieval Research. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017). 1253–1256
work page 2017
-
[30]
Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, and Sebastian Riedel
-
[31]
Do Large Language Models Latently Perform Multi-Hop Reasoning?. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 10210–10229
-
[32]
Jun Zhao, Yongzhuo Yang, Xiang Hu, Jingqi Tong, Yi Lu, Wei Wu, Tao Gui, Qi Zhang, and Xuanjing Huang. 2025. Understanding Parametric and Contextual Knowledge Reconciliation within Large Language Models. InProceedings of the 39th International Conference on Neural Information Processing Systems
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.