arxiv: 2605.03229 · v1 · submitted 2026-05-04 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Sparse Memory Finetuning as a Low-Forgetting Alternative to LoRA and Full Finetuning

Prakhar Gupta , Garv Shah , Satyam Goyal , Anirudh Kanchi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:47 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords Sparse Memory Finetuningcatastrophic forgettingkey-value memory layersMedMCQAlanguage model adaptationLoRA comparisontask-specific finetuning

0 comments

The pith

Sparse Memory Finetuning updates only the most heavily read rows in added key-value layers to gain task performance while preserving general capabilities better than LoRA or full finetuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sparse Memory Finetuning to address catastrophic forgetting when adapting pretrained language models to new tasks. It inserts key-value memory layers and, at each training step, modifies only the small subset of rows that the current data batch reads most. Tests on a 0.5B model using a medical exam task show a 2.5 point accuracy rise while two general-knowledge probes stay within one point of the untouched base model. Standard methods such as LoRA and full finetuning deliver bigger task gains but produce clear drops on the same probes. Two row-selection criteria are compared to show how they trade off the two forgetting measures differently.

Core claim

Sparse Memory Finetuning works by adding key-value memory layers to a pretrained model and updating only the rows most heavily read by each training batch. On the MedMCQA medical multiple-choice task this produces a 2.5 percentage point gain while WikiText perplexity and TriviaQA accuracy remain within roughly one point of the base model. LoRA and full finetuning reach larger task improvements but cause noticeable drift on both probes. KL-divergence and TF-IDF row-selection rules balance the forgetting metrics in different ways.

What carries the argument

Key-value memory layers updated selectively by row-selection rules (KL-divergence or TF-IDF) that identify and modify only the most heavily read rows per batch.

If this is right

Task-specific gains become possible through limited changes to memory rows alone.
General language modeling and trivia performance can stay close to the base model after adaptation.
Different row-selection rules produce distinct trade-offs between the two forgetting probes.
The method offers a middle path when larger updates cause excessive capability drift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the selective-update rule scales, repeated task additions could be made with lower risk of cumulative forgetting.
The approach might combine with other parameter-efficient methods to further reduce interference.
Testing on sequential tasks or larger base models would check whether the stability advantage persists beyond the reported single-task case.
Row selection based on data similarity could generalize to other memory-augmented model designs.

Load-bearing premise

Selectively updating only the most heavily read memory rows is enough to acquire new task knowledge without interfering with unrelated general capabilities.

What would settle it

Applying the same MedMCQA training and measuring TriviaQA accuracy or WikiText perplexity drops larger than one point under SMF would show the low-forgetting result does not hold.

Figures

Figures reproduced from arXiv: 2605.03229 by Anirudh Kanchi, Garv Shah, Prakhar Gupta, Satyam Goyal.

**Figure 1.** Figure 1: Method overview. Each Qwen-2.5 transformer block uses RMSNorm, grouped-query self-attention, and a SwiGLU MLP down proj(silu(gate proj(x)) ⊙ up proj(x)). We compare two ways of inserting a Hashing Memory Layer at selected layers: Replacement substitutes the MLP entirely (left); Additive keeps the MLP and adds a memory-scaled branch (right). The middle panel details the memory layer (Section 3.1): queries a… view at source ↗

**Figure 2.** Figure 2: Plasticity–stability frontier on MedMCQA. Each method appears as one point with cross-seed standard-deviation error bars. Color encodes the method family. For sparse methods, marker shape distinguishes the slot-selection rule (circle = KL, triangle = TF-IDF). Non-sparse baselines (Base Qwen, LoRA, Full finetune) appear as their own labeled points. Top-left of the left panel and top-right of the right panel… view at source ↗

read the original abstract

Adapting a pretrained language model to a new task often hurts the general capabilities it already had, a problem known as catastrophic forgetting. Sparse Memory Finetuning (SMF) tries to avoid this by adding key-value memory layers to the model and, on each training step, updating only the small set of memory rows that the current batch reads most heavily. We re-implement SMF on Qwen-2.5-0.5B-Instruct and compare it with LoRA and full finetuning on MedMCQA, a 4-choice medical exam task, using WikiText perplexity and TriviaQA accuracy as forgetting probes. SMF improves MedMCQA by 2.5 percentage points while keeping both forgetting probes within roughly 1 point of the base model, whereas LoRA and full finetuning achieve larger gains but with clear drift on both. We also compare two row-selection rules (KL-divergence and TF-IDF), which balance the two forgetting metrics differently.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Sparse Memory Finetuning (SMF), which augments a pretrained LLM with key-value memory layers and, on each training step, updates only the small subset of memory rows most heavily read by the current batch (selected via KL-divergence or TF-IDF). On MedMCQA using Qwen-2.5-0.5B-Instruct, SMF yields a 2.5 pp accuracy gain while holding WikiText perplexity and TriviaQA accuracy within ~1 point of the base model; LoRA and full finetuning produce larger task gains but exhibit clear drift on both forgetting probes. The work also contrasts the two row-selection heuristics.

Significance. If the sparsity mechanism is confirmed to be the source of the observed stability, SMF would constitute a practical, low-overhead alternative to parameter-efficient methods for task adaptation that preserves general capabilities. The direct head-to-head comparison on a concrete medical QA task with two standard forgetting probes supplies useful empirical data, and the provision of two distinct selection rules allows readers to see trade-offs in retention metrics.

major comments (2)

[Experimental results (and abstract)] The central empirical claim attributes the low-forgetting outcome to the sparse row-update rule, yet the manuscript contains no dense-update control in which every memory row is updated on each step (or under a matched total gradient budget). Without this ablation it remains possible that the stability is produced simply by the insertion of the key-value memory layers rather than by the sparsity that the method name and abstract emphasize.
[Abstract] Abstract and results sections report comparative deltas (2.5 pp MedMCQA gain, ~1-point drift on probes) but supply no implementation details, statistical tests, error bars, number of runs, or data-exclusion criteria, preventing assessment of whether the reported margins are robust.

minor comments (1)

[Abstract] The description of how KL-divergence versus TF-IDF selection differentially affects the two forgetting metrics could be expanded with a short quantitative comparison or table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which highlight important aspects for strengthening our empirical claims. We address the two major comments below and outline the revisions we will make.

read point-by-point responses

Referee: [Experimental results (and abstract)] The central empirical claim attributes the low-forgetting outcome to the sparse row-update rule, yet the manuscript contains no dense-update control in which every memory row is updated on each step (or under a matched total gradient budget). Without this ablation it remains possible that the stability is produced simply by the insertion of the key-value memory layers rather than by the sparsity that the method name and abstract emphasize.

Authors: We agree that including a dense-update control for the memory layers would provide stronger evidence that the sparsity mechanism is responsible for the observed stability rather than the mere addition of the key-value memory layers. Our current comparisons are with LoRA and full finetuning, which demonstrate that SMF achieves better retention, but they do not isolate the sparsity effect within the memory-augmented model. In the revised manuscript, we will add an ablation where all memory rows are updated on each step, with a matched total gradient budget or update frequency to ensure fair comparison. This will allow us to directly attribute the low-forgetting property to the sparse selection rule. revision: yes
Referee: [Abstract] Abstract and results sections report comparative deltas (2.5 pp MedMCQA gain, ~1-point drift on probes) but supply no implementation details, statistical tests, error bars, number of runs, or data-exclusion criteria, preventing assessment of whether the reported margins are robust.

Authors: We acknowledge the need for greater transparency in reporting experimental details to allow readers to evaluate the robustness of our results. The current manuscript provides the main performance deltas but omits specifics such as the number of independent runs, variance measures, and statistical tests. In the revision, we will update the abstract and results sections to include: the number of runs performed (with details on random seeds), error bars or standard deviations where applicable, results of statistical significance tests (e.g., t-tests comparing methods), and any criteria used for data exclusion or preprocessing. We will also add more implementation details on hyperparameters and the exact mechanics of the row selection heuristics. revision: yes

Circularity Check

0 steps flagged

No circularity detected; purely empirical measurements with no derivation chain.

full rationale

The manuscript reports direct experimental results from re-implementing SMF on Qwen-2.5-0.5B-Instruct and measuring MedMCQA accuracy plus WikiText/TriviaQA forgetting probes against LoRA and full finetuning baselines. No equations, first-principles derivations, or predictions are claimed; the central claims are observed performance deltas. No self-citations form load-bearing premises, no parameters are fitted then relabeled as predictions, and no ansatz or uniqueness theorem is invoked. The method is defined operationally by the update rule and selection heuristics, with outcomes measured independently on held-out tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work is an empirical comparison with no first-principles derivation; the central claim rests on the untested assumption that sparse memory updates isolate task learning.

axioms (1)

domain assumption Selectively updating only the most relevant memory rows suffices for task adaptation without affecting unrelated knowledge.
This is the core premise of the SMF design as stated in the abstract.

invented entities (1)

Sparse key-value memory layers no independent evidence
purpose: To store and selectively update task-specific information with minimal impact on the base model.
New architectural component introduced by the method.

pith-pipeline@v0.9.0 · 5487 in / 1252 out tokens · 33606 ms · 2026-05-08T17:47:39.561915+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation / Foundation.LogicAsFunctionalEquation (J(x) = ½(x+x⁻¹)−1) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

s_kl(i) = p_batch(i) · log((p_batch(i)+ε)/(p_bg(i)+ε))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references

[1]

2025 , eprint=

Continual Learning via Sparse Memory Finetuning , author=. 2025 , eprint=

2025
[2]

Advances in Neural Information Processing Systems , year=

Large Memory Layers with Product Keys , author=. Advances in Neural Information Processing Systems , year=
[3]

2024 , eprint=

Memory Layers at Scale , author=. 2024 , eprint=

2024
[4]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=
[5]

2022 , organization=

Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan , booktitle=. 2022 , organization=

2022
[6]

and Zettlemoyer, Luke , booktitle=

Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke , booktitle=
[7]

International Conference on Learning Representations , year=

Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=
[8]

Advances in Neural Information Processing Systems , year=

K. Advances in Neural Information Processing Systems , year=