Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

Iyiola E. Olatunji; Jonathan P. Shock; Matthew L. Smith; Samuel T. Segun; Tegawend\'e F. Bissyand\'e

arxiv: 2605.18732 · v1 · pith:ZQKSFTX2new · submitted 2026-05-18 · 💻 cs.CL · cs.AI· cs.LG

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

Matthew L. Smith , Jonathan P. Shock , Samuel T. Segun , Iyiola E. Olatunji , Tegawend\'e F. Bissyand\'e This is my paper

Pith reviewed 2026-05-20 10:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords factual recallscaling lawslarge language modelstraining data compositionconfabulationsmodel sizetopic frequencysigmoid function

0 comments

The pith

Factual recall in large language models improves systematically with larger model size and greater topic frequency in training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates many language models on thousands of scholarly references using an automated checker. It finds that recall quality follows a sigmoid shape driven by the combined effect of model parameter count and how often the topic shows up in training data. These two factors alone account for 60 percent of performance differences across models from different families, and even more within the same family. A sympathetic reader would care because this turns factual errors from seeming random into something that can be anticipated and perhaps reduced by choices in model scale or data composition.

Core claim

Recall quality follows a sigmoid in the log-linear combination of model parameter count and topic representation in training data. These two variables alone explain 60% of the variance across 16 dense models from four families, rising to 74-94% within individual families. The form matches a superposition-inspired account in which recall is gated by a signal-to-noise ratio: signal strength scales with concept frequency and the noise floor with model capacity.

What carries the argument

A sigmoid function of the log-linear sum of model parameter count and topic representation in training data, which gates factual recall through a signal-to-noise ratio.

If this is right

Recall accuracy on specific facts becomes forecastable for new models without running them on every reference.
Models within one family exhibit even tighter scaling, allowing more precise predictions inside a given architecture.
Recall improves sharply once the combined size-frequency measure crosses a threshold, rather than improving gradually.
The signal-to-noise view implies that boosting either model capacity or topic frequency in data raises recall on that topic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data collection efforts could focus on increasing frequency for high-stakes topics to reduce errors in a targeted way.
Very large models may eventually achieve near-perfect recall on any topic given sufficient data exposure, while rare topics lag behind.
The same scaling pattern might appear in other forms of knowledge retrieval beyond scholarly references.

Load-bearing premise

That the amount of each topic in the training data can be measured accurately and that the automated system detects correct recall without systematic bias or error.

What would settle it

Measuring recall accuracy on new references and models and finding that the points do not follow the predicted sigmoid when plotted against the log-linear size-and-frequency score.

Figures

Figures reproduced from arXiv: 2605.18732 by Iyiola E. Olatunji, Jonathan P. Shock, Matthew L. Smith, Samuel T. Segun, Tegawend\'e F. Bissyand\'e.

**Figure 2.** Figure 2: Factual recall quality scales log-linearly with model size across architectures [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Topic representation drives recall quality across model families. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Factual recall quality follows a sigmoid in the log-linear combination of model [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Larger models recall less-cited papers. Median citation count of correctly recalled references (SourceVerify status verified or verified-with-error) plotted against model size on a log–log scale, for 10 dense models from four families with n ≥ 50 matched references each. Error bars are bootstrap 95% CIs on the per-model median (10,000 resamples). The dashed line is a weighted log–log fit using the bootstra… view at source ↗

read the original abstract

While scaling laws govern aggregate large language model performance, no scaling law has linked factual recall to both model size and training-data composition. We evaluated 38 models on over 8,900 scholarly references evaluated by an automated reference verification system. Recall quality follows a sigmoid in the log-linear combination of model parameter count and topic representation in training data. These two variables alone explain 60% of the variance across 16 dense models from four families, rising to 74-94% within individual families. The form matches a superposition-inspired account in which recall is gated by a signal-to-noise ratio: signal strength scales with concept frequency and the noise floor with model capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Factual recall on scholarly topics follows a sigmoid in model size plus topic frequency, explaining 60% variance overall, but the frequency estimates rest on unvalidated proxies for closed models.

read the letter

The main thing here is that factual recall on academic references tracks a sigmoid in the log-linear sum of parameter count and estimated topic frequency in training data. Across 16 dense models from four families this accounts for 60% of the variance, and within families the number climbs to 74-94%. They ran the test on 38 models total and 8900 references using an automated verifier, which is a solid scale for an empirical claim like this.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates factual recall across 38 LLMs on more than 8,900 scholarly references using an automated verification system. It reports that recall quality follows a sigmoid function of the log-linear combination of model parameter count and estimated topic representation in training data. These two variables explain 60% of the variance across 16 dense models from four families (rising to 74-94% within families). The functional form is interpreted as consistent with a superposition-inspired signal-to-noise account in which signal strength scales with concept frequency and noise floor with model capacity.

Significance. If the central empirical relation holds after addressing measurement concerns, the work would supply a rare scaling law that jointly incorporates model scale and training-data composition for a specific capability (factual recall), extending beyond aggregate performance laws. The scale of the evaluation (38 models, thousands of references) and the within-family consistency are strengths. The post-hoc invocation of superposition provides an intuitive framing but does not yet constitute a derivation.

major comments (3)

[§3 and §4] §3 (Methods) and §4 (Results): The procedure for quantifying topic representation relies on an external proxy for proprietary training corpora, yet no validation of the proxy's correlation with actual pre-training exposure is reported, nor are any sensitivity analyses to alternative proxies. Because the log-linear predictor is load-bearing for the 60% variance claim, weak or size-correlated proxy error would directly undermine the reported scaling relation.
[§4.2] §4.2 (sigmoid fit description): The automated reference verification system is treated as an unbiased measure of factual recall, but no calibration against human judgments, accuracy rates, or size-/topic-dependent detection biases are provided. This measurement assumption is central to all reported recall rates and variance figures.
[§5] §5 (Discussion): The superposition account is introduced after the empirical sigmoid is observed rather than used to derive the functional form a priori. Consequently the explanation depends on the same data used for the fit, reducing its independent predictive value.

minor comments (2)

[§4] The manuscript should report error bars or bootstrap confidence intervals on the sigmoid parameters and on the R² values.
[§4.1] Notation for the log-linear combination and the two free parameters of the sigmoid should be defined explicitly in the main text or an equation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each of the major comments below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§3 and §4] §3 (Methods) and §4 (Results): The procedure for quantifying topic representation relies on an external proxy for proprietary training corpora, yet no validation of the proxy's correlation with actual pre-training exposure is reported, nor are any sensitivity analyses to alternative proxies. Because the log-linear predictor is load-bearing for the 60% variance claim, weak or size-correlated proxy error would directly undermine the reported scaling relation.

Authors: We agree that additional validation of the proxy would strengthen the claims. Direct correlation with proprietary training data is not feasible as these corpora are not publicly available. However, we will add sensitivity analyses using alternative proxies for topic frequency, such as term frequencies in large public corpora like Common Crawl or Wikipedia, and report the robustness of the scaling relation under these alternatives. This will be incorporated into the revised §3 and §4. revision: yes
Referee: [§4.2] §4.2 (sigmoid fit description): The automated reference verification system is treated as an unbiased measure of factual recall, but no calibration against human judgments, accuracy rates, or size-/topic-dependent detection biases are provided. This measurement assumption is central to all reported recall rates and variance figures.

Authors: We acknowledge the importance of validating the automated system. We will include a new subsection detailing calibration on a randomly sampled subset of 200 references, where we compare the automated verification against human expert judgments. We will report agreement rates (e.g., Cohen's kappa) and analyze any systematic biases related to model size or topic. If biases are detected, we will discuss their potential impact on the results. This addresses the central measurement assumption. revision: yes
Referee: [§5] §5 (Discussion): The superposition account is introduced after the empirical sigmoid is observed rather than used to derive the functional form a priori. Consequently the explanation depends on the same data used for the fit, reducing its independent predictive value.

Authors: The referee correctly notes that the superposition-inspired account is a post-hoc interpretation. We will revise the Discussion to clarify that the sigmoid form was identified empirically from the data, and the account serves to provide an intuitive mechanistic framing rather than an a priori derivation. We will emphasize that this interpretation generates testable predictions for future experiments, such as interventions on training data composition, and acknowledge the limitations of post-hoc explanations. revision: partial

Circularity Check

0 steps flagged

Empirical fit with no circular derivation chain

full rationale

The paper reports an empirical scaling observation: recall quality is modeled as a sigmoid of the log-linear combination of model parameter count and topic frequency, with the fit explaining 60% variance across models. This is a post-hoc statistical description of observed data rather than a first-principles derivation whose functional form or result reduces to the inputs by construction. The superposition-inspired account is presented as matching the observed sigmoid after fitting, not as the source of a forced equation or uniqueness theorem. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described claims. The result is therefore self-contained as a data-driven correlation without circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The result rests on empirical measurement of topic frequency and recall quality; the sigmoid parameters are fitted rather than derived from first principles, and the superposition framing is post-hoc.

free parameters (1)

sigmoid midpoint and slope
Parameters of the sigmoid function are chosen to fit the observed recall data across models and topics.

axioms (1)

domain assumption Automated reference verification system accurately detects factual recall without systematic false positives or negatives.
This underpins the evaluation of over 8900 scholarly references and the reported variance numbers.

pith-pipeline@v0.9.0 · 5672 in / 1225 out tokens · 30261 ms · 2026-05-20T10:55:34.032052+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 8 internal anchors

[1]

Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. Why language models hallucinate.arXiv preprint, 2025. arXiv:2509.04664. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint, 2020. arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[3]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Physics of language models: Part 3.3, knowledge capacity scaling laws.arXiv preprint, 2024

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws.arXiv preprint, 2024. arXiv:2404.05405. Published at ICLR 2025

work page arXiv 2024
[5]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. URLhttps://transformer-circuits.pub/2022...

work page 2022
[6]

Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield- Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

work page
[7]

URLhttps://transformer-circuits.pub/2023/monosemantic-features

work page 2023
[8]

Turner, Callum McDougall, Monte MacDiarmid, C

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosem...

work page 2024
[9]

Superposition as lossy compression: Measure with sparse autoencoders and connect to adversarial vulnerability.arXiv preprint, 2024

Leonard Bereska, Zoe Tzifa-Kratira, Reza Samavi, and Efstratios Gavves. Superposition as lossy compression: Measure with sparse autoencoders and connect to adversarial vulnerability.arXiv preprint, 2024. arXiv:2512.13568

work page arXiv 2024
[10]

Superposition Yields Robust Neural Scaling

Yizhou Liu, Ziming Liu, and Jeff Gore. Superposition yields robust neural scaling. In Advances in Neural Information Processing Systems 38 (NeurIPS), 2025. Best Paper Runner-Up. arXiv:2505.10465

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Large language models struggle to learn long-tail knowledge

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023. arXiv:2211.08411

work page arXiv 2023
[12]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023. arXiv:2212.10511

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Head-to-tail: How knowledgeable are large language models (LLMs)? A.K.A

Kai Sun, Yifan Ethan Xu, Hanwen Zha, Yue Liu, and Xin Luna Dong. Head-to-tail: How knowledgeable are large language models (LLMs)? A.K.A. will LLMs replace knowledge graphs? InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL),

work page 2024
[14]

Scaling laws for fact memorization of large language models

Xingyu Lu, Xiaonan Li, Qinyuan Cheng, Kai Ding, Xuanjing Huang, and Xipeng Qiu. Scaling laws for fact memorization of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024. arXiv:2406.15720

work page arXiv 2024
[15]

Fung, Kathleen McKeown, Chengxiang Zhai, Manling Li, and Heng Ji

Yuji Zhang, Sha Li, Cheng Qian, Jiateng Liu, Pengfei Yu, Chi Han, Yi R. Fung, Kathleen McKeown, Chengxiang Zhai, Manling Li, and Heng Ji. The law of knowledge overshadowing: Towards understanding, predicting, and preventing LLM hallucination. InProceedings of the Eighth FEVER Workshop at ACL 2025, 2025. arXiv:2502.16143. 17

work page arXiv 2025
[16]

Towards a holistic evaluation of LLMs on factual knowledge recall.arXiv preprint, 2024

Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan Min, Patrick Ng, and Zhiguo Wang. Towards a holistic evaluation of LLMs on factual knowledge recall.arXiv preprint, 2024. arXiv:2404.16164

work page arXiv 2024
[17]

WorldBench: Quantifying geographic disparities in LLM factual recall

Mazda Moayeri, Elham Tabassi, and Soheil Feizi. WorldBench: Quantifying geographic disparities in LLM factual recall. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 1211–1228, Rio de Janeiro, Brazil, 2024. ACM. doi: 10.1145/3630106.3658967

work page doi:10.1145/3630106.3658967 2024
[18]

Quantifying Memorization Across Neural Language Models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models.arXiv preprint, 2023. arXiv:2202.07646. Published at ICLR 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Scaling Laws and Interpretability of Learning from Repeated Data

Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, Scott Johnston, Ben Mann, Chris Olah, Catherine Olsson, Dario Amodei, Nicholas Joseph, Jared Kaplan, and Sam McCandlish. Scaling laws and interpretability of learning from repeated data.arXiv preprint, 2022. ar...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

SoK: The landscape of memorization in LLMs: Mechanisms, measurement, and mitigation.arXiv preprint,

Alexander Xiong, Xuandong Zhao, Aneesh Pappu, and Dawn Song. SoK: The landscape of memorization in LLMs: Mechanisms, measurement, and mitigation.arXiv preprint,

work page
[21]

Claude E. Shannon. Coding theorems for a discrete source with a fidelity criterion.IRE National Convention Record, 7(4):142–163, 1959

work page 1959
[22]

Aaron Clauset, Cosma Rohilla Shalizi, and Mark E. J. Newman. Power-law distributions in empirical data.SIAM Review, 51(4):661–703, 2009. doi: 10.1137/070710111

work page doi:10.1137/070710111 2009
[23]

Michaud, Ziming Liu, Uzay Girit, and Max Tegmark

Eric J. Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling. InAdvances in Neural Information Processing Systems 36 (NeurIPS),

work page
[24]

SourceVerify.https://sourceverify.ai/, 2026

work page 2026
[25]

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limitation of large language models.arXiv preprint, 2024. arXiv:2401.11817. 18

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. Why language models hallucinate.arXiv preprint, 2025. arXiv:2509.04664. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint, 2020. arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020

[3] [3]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Physics of language models: Part 3.3, knowledge capacity scaling laws.arXiv preprint, 2024

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws.arXiv preprint, 2024. arXiv:2404.05405. Published at ICLR 2025

work page arXiv 2024

[5] [5]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. URLhttps://transformer-circuits.pub/2022...

work page 2022

[6] [6]

Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield- Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

work page

[7] [7]

URLhttps://transformer-circuits.pub/2023/monosemantic-features

work page 2023

[8] [8]

Turner, Callum McDougall, Monte MacDiarmid, C

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosem...

work page 2024

[9] [9]

Superposition as lossy compression: Measure with sparse autoencoders and connect to adversarial vulnerability.arXiv preprint, 2024

Leonard Bereska, Zoe Tzifa-Kratira, Reza Samavi, and Efstratios Gavves. Superposition as lossy compression: Measure with sparse autoencoders and connect to adversarial vulnerability.arXiv preprint, 2024. arXiv:2512.13568

work page arXiv 2024

[10] [10]

Superposition Yields Robust Neural Scaling

Yizhou Liu, Ziming Liu, and Jeff Gore. Superposition yields robust neural scaling. In Advances in Neural Information Processing Systems 38 (NeurIPS), 2025. Best Paper Runner-Up. arXiv:2505.10465

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Large language models struggle to learn long-tail knowledge

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023. arXiv:2211.08411

work page arXiv 2023

[12] [12]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023. arXiv:2212.10511

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Head-to-tail: How knowledgeable are large language models (LLMs)? A.K.A

Kai Sun, Yifan Ethan Xu, Hanwen Zha, Yue Liu, and Xin Luna Dong. Head-to-tail: How knowledgeable are large language models (LLMs)? A.K.A. will LLMs replace knowledge graphs? InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL),

work page 2024

[14] [14]

Scaling laws for fact memorization of large language models

Xingyu Lu, Xiaonan Li, Qinyuan Cheng, Kai Ding, Xuanjing Huang, and Xipeng Qiu. Scaling laws for fact memorization of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024. arXiv:2406.15720

work page arXiv 2024

[15] [15]

Fung, Kathleen McKeown, Chengxiang Zhai, Manling Li, and Heng Ji

Yuji Zhang, Sha Li, Cheng Qian, Jiateng Liu, Pengfei Yu, Chi Han, Yi R. Fung, Kathleen McKeown, Chengxiang Zhai, Manling Li, and Heng Ji. The law of knowledge overshadowing: Towards understanding, predicting, and preventing LLM hallucination. InProceedings of the Eighth FEVER Workshop at ACL 2025, 2025. arXiv:2502.16143. 17

work page arXiv 2025

[16] [16]

Towards a holistic evaluation of LLMs on factual knowledge recall.arXiv preprint, 2024

Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan Min, Patrick Ng, and Zhiguo Wang. Towards a holistic evaluation of LLMs on factual knowledge recall.arXiv preprint, 2024. arXiv:2404.16164

work page arXiv 2024

[17] [17]

WorldBench: Quantifying geographic disparities in LLM factual recall

Mazda Moayeri, Elham Tabassi, and Soheil Feizi. WorldBench: Quantifying geographic disparities in LLM factual recall. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 1211–1228, Rio de Janeiro, Brazil, 2024. ACM. doi: 10.1145/3630106.3658967

work page doi:10.1145/3630106.3658967 2024

[18] [18]

Quantifying Memorization Across Neural Language Models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models.arXiv preprint, 2023. arXiv:2202.07646. Published at ICLR 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Scaling Laws and Interpretability of Learning from Repeated Data

Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, Scott Johnston, Ben Mann, Chris Olah, Catherine Olsson, Dario Amodei, Nicholas Joseph, Jared Kaplan, and Sam McCandlish. Scaling laws and interpretability of learning from repeated data.arXiv preprint, 2022. ar...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

SoK: The landscape of memorization in LLMs: Mechanisms, measurement, and mitigation.arXiv preprint,

Alexander Xiong, Xuandong Zhao, Aneesh Pappu, and Dawn Song. SoK: The landscape of memorization in LLMs: Mechanisms, measurement, and mitigation.arXiv preprint,

work page

[21] [21]

Claude E. Shannon. Coding theorems for a discrete source with a fidelity criterion.IRE National Convention Record, 7(4):142–163, 1959

work page 1959

[22] [22]

Aaron Clauset, Cosma Rohilla Shalizi, and Mark E. J. Newman. Power-law distributions in empirical data.SIAM Review, 51(4):661–703, 2009. doi: 10.1137/070710111

work page doi:10.1137/070710111 2009

[23] [23]

Michaud, Ziming Liu, Uzay Girit, and Max Tegmark

Eric J. Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling. InAdvances in Neural Information Processing Systems 36 (NeurIPS),

work page

[24] [24]

SourceVerify.https://sourceverify.ai/, 2026

work page 2026

[25] [25]

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limitation of large language models.arXiv preprint, 2024. arXiv:2401.11817. 18

work page internal anchor Pith review Pith/arXiv arXiv 2024