The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning

Gintare Karolina Dziugaite; Jonathan Ragan-Kelley; Michael Carbin; Nolan Clement; Tian Jin; Vaishnavh Nagarajan; Xin Dong

arxiv: 2310.04680 · v1 · pith:LUS63OLPnew · submitted 2023-10-07 · 💻 cs.CL · cs.AI· cs.LG

The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning

Tian Jin , Nolan Clement , Xin Dong , Vaishnavh Nagarajan , Michael Carbin , Jonathan Ragan-Kelley , Gintare Karolina Dziugaite This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords scalingin-contextmodelcapabilitiesfactlearningrecallcore

0 comments

read the original abstract

How does scaling the number of parameters in large language models (LLMs) affect their core capabilities? We study two natural scaling techniques -- weight pruning and simply training a smaller or larger model, which we refer to as dense scaling -- and their effects on two core capabilities of LLMs: (a) recalling facts presented during pre-training and (b) processing information presented in-context during inference. By curating a suite of tasks that help disentangle these two capabilities, we find a striking difference in how these two abilities evolve due to scaling. Reducing the model size by more than 30\% (via either scaling approach) significantly decreases the ability to recall facts seen in pre-training. Yet, a 60--70\% reduction largely preserves the various ways the model can process in-context information, ranging from retrieving answers from a long context to learning parameterized functions from in-context exemplars. The fact that both dense scaling and weight pruning exhibit this behavior suggests that scaling model size has an inherently disparate effect on fact recall and in-context learning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Don't Go Breaking My LLM: The Impact of Pruning Attention Layers on Explanation Faithfulness and Confidence Calibration
cs.LG 2026-06 unverdicted novelty 6.0

Pruning attention layers in five LLMs across eight datasets maintains accuracy but degrades faithfulness and calibration.