Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Eran Ofek; Eyal Ben-David; Gal Yona; Nitay Calderon; Zorik Gekhman

arxiv: 2602.14080 · v2 · pith:VNBMUBUPnew · submitted 2026-02-15 · 💻 cs.CL · cs.AI

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Nitay Calderon , Eyal Ben-David , Zorik Gekhman , Eran Ofek , Gal Yona This is my paper

classification 💻 cs.CL cs.AI

keywords factsfailuresknowledgerecallrecalledaccessbenchmarkbottleneck

0 comments

read the original abstract

Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LMs as Task-Specific Knowledge Bases: An Interpretability Analysis
cs.CL 2026-06 unverdicted novelty 6.0

LMs store facts in task-specific parameter subsets, shown by inconsistent emergence across tasks during training and distinct localized parameters for the same fact.
Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
cs.CL 2026-05 unverdicted novelty 6.0

Larger LLMs hallucinate more often despite having the correct concept available because instruction tuning causes probability mass to disperse across alternative surface forms instead of concentrating on one.