arxiv: 2601.07372 · v1 · submitted 2026-01-12 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Xin Cheng , Wangding Zeng , Damai Dai , Qinyu Chen , Bingxuan Wang , Zhenda Xie , Kezhao Huang , Xingkai Yu

show 6 more authors

Zhewen Hao Yukun Li Han Zhang Huishuai Zhang Dongyan Zhao Wenfeng Liang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 04:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords conditional memorymixture of expertssparsity allocationn-gram embeddingsknowledge lookupscaling lawslarge language models

0 comments

The pith

Engram introduces conditional memory as a new sparsity axis that lets large language models perform direct O(1) knowledge lookups instead of computing retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes conditional memory to give transformers a native way to retrieve stored knowledge without simulating it through repeated computation. It realizes this idea with Engram, a module that updates classic n-gram methods into a scalable lookup table. The authors identify a U-shaped scaling law that determines the best split between mixture-of-experts computation and this static memory, then follow the law to grow the memory component to 27 billion parameters. When compared against a matched MoE model using identical total parameters and compute, the Engram version shows clear gains on reasoning benchmarks and long-context retrieval. Mechanistic checks indicate the memory module offloads early-layer reconstruction work and leaves more attention capacity for global patterns.

Core claim

Conditional memory via Engram supplies an O(1) lookup primitive that complements mixture-of-experts computation. When sparsity is allocated according to the observed U-shaped law, scaling the memory module to 27 billion parameters produces higher scores than an iso-parameter, iso-FLOPs MoE baseline on reasoning tasks such as BBH and ARC-Challenge and on long-context retrieval such as Multi-Query NIAH. The module also enables deterministic prefetching with negligible runtime cost.

What carries the argument

Engram, a module that modernizes n-gram embeddings into a conditional memory table for single-step lookup.

Load-bearing premise

The measured gains come from the memory lookups themselves rather than from differences in training procedure or other architectural details.

What would settle it

An ablation that disables Engram while keeping every other component identical and then records no drop in BBH or Multi-Query NIAH scores would falsify the claim that conditional memory drives the improvements.

read the original abstract

While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic $N$-gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains~(HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone's early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0). Finally, Engram establishes infrastructure-aware efficiency: its deterministic addressing enables runtime prefetching from host memory, incurring negligible overhead. We envision conditional memory as an indispensable modeling primitive for next-generation sparse models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Engram adds a usable memory-lookup axis that beats matched MoE at 27B on reasoning and retrieval, but the U-shaped allocation law looks fitted rather than derived.

read the letter

The core takeaway is that this paper gives LLMs a native O(1) lookup primitive called Engram that sits alongside MoE instead of trying to simulate retrieval through more computation. At 27B parameters they report clear wins over an iso-parameter, iso-FLOP MoE baseline, especially on BBH (+5), ARC-Challenge (+3.7), and long-context retrieval (NIAH from 84 to 97). The mechanistic story—that it offloads static facts from early layers and frees attention for global context—lines up with the numbers they show.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Engram, a scalable N-gram lookup module implementing conditional memory as a new sparsity axis complementary to Mixture-of-Experts (MoE) computation in Transformers. By formulating the Sparsity Allocation problem, the authors identify a U-shaped scaling law governing the optimal trade-off between MoE layers and Engram memory capacity. Guided by this law, they scale an Engram-augmented model to 27B parameters and report superior performance versus a strictly iso-parameter and iso-FLOPs MoE baseline, with gains on reasoning (BBH +5.0, ARC-Challenge +3.7), knowledge (MMLU +3.4), code/math (HumanEval +3.0, MATH +2.4), and long-context retrieval (Multi-Query NIAH 84.2 to 97.0). Mechanistic analyses claim that Engram relieves early-layer static reconstruction and frees attention for global context, while its deterministic addressing enables efficient host-memory prefetching.

Significance. If the U-shaped law generalizes and the performance deltas are causally attributable to the memory module rather than experimental confounds, the work would introduce a practical new modeling primitive that augments conditional computation with conditional memory. The infrastructure-aware efficiency claim and the observation of larger gains in reasoning than pure retrieval are potentially high-impact for scaling sparse LLMs. The paper ships concrete scaling results at 27B, which strengthens the empirical case if the allocation procedure is shown to be non-circular.

major comments (3)

[Sparsity Allocation problem] Sparsity Allocation section: The U-shaped scaling law is presented as uncovered from the Sparsity Allocation problem and then used to select the MoE/Engram split at 27B parameters. The manuscript supplies no derivation details, fitting procedure, or independent validation set, leaving open whether the reported 27B allocation was determined a priori or fitted to the same performance data used to claim superiority over the iso-FLOPs MoE baseline.
[Experimental results] Experimental results and baselines: Concrete deltas are reported (BBH +5.0, Multi-Query NIAH 84.2 to 97.0) yet the text provides no information on statistical significance, variance across seeds, exact baseline MoE implementation details, or how iso-FLOPs equivalence was enforced when adding the Engram module. These omissions are load-bearing for the central claim that conditional memory produces the observed gains.
[Mechanistic analyses] Mechanistic analyses: The statements that Engram 'relieves the backbone's early layers from static reconstruction' and 'frees up attention capacity for global context' are supported only by qualitative description. Quantitative evidence such as layer-wise activation norms, attention entropy statistics, or controlled ablations isolating the memory module from other architectural changes is required to establish causality.

minor comments (2)

[Abstract] The abstract states that Engram 'modernizes classic N-gram embedding' but does not specify the exact modifications (hashing scheme, embedding dimension scaling, or collision handling) that enable O(1) lookup at 27B scale; a brief equation or pseudocode would improve clarity.
[Sparsity Allocation problem] Notation for the sparsity allocation ratio is introduced without an explicit equation linking it to the U-shaped law; adding a numbered equation would make the scaling claim easier to follow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below and commit to revisions that strengthen the empirical and mechanistic claims without altering the core contributions.

read point-by-point responses

Referee: [Sparsity Allocation problem] Sparsity Allocation section: The U-shaped scaling law is presented as uncovered from the Sparsity Allocation problem and then used to select the MoE/Engram split at 27B parameters. The manuscript supplies no derivation details, fitting procedure, or independent validation set, leaving open whether the reported 27B allocation was determined a priori or fitted to the same performance data used to claim superiority over the iso-FLOPs MoE baseline.

Authors: We agree the current text omits these details. In revision we will add the full mathematical formulation of the Sparsity Allocation objective, the exact fitting procedure (including the held-out validation set and hyperparameter search protocol), and explicit confirmation that the 27B MoE/Engram split was selected using only the validation data before any final test-set evaluation. This will demonstrate the allocation was determined a priori relative to the reported benchmark numbers. revision: yes
Referee: [Experimental results] Experimental results and baselines: Concrete deltas are reported (BBH +5.0, Multi-Query NIAH 84.2 to 97.0) yet the text provides no information on statistical significance, variance across seeds, exact baseline MoE implementation details, or how iso-FLOPs equivalence was enforced when adding the Engram module. These omissions are load-bearing for the central claim that conditional memory produces the observed gains.

Authors: We acknowledge these omissions. The revised manuscript will report (i) p-values from paired statistical tests across the reported metrics, (ii) standard deviation over at least three independent random seeds for all main results, (iii) the precise MoE baseline architecture (number of experts, top-k, expert size) and training hyperparameters, and (iv) the exact FLOPs accounting that keeps total training and inference compute matched when Engram is added (by reducing MoE layer width proportionally). revision: yes
Referee: [Mechanistic analyses] Mechanistic analyses: The statements that Engram 'relieves the backbone's early layers from static reconstruction' and 'frees up attention capacity for global context' are supported only by qualitative description. Quantitative evidence such as layer-wise activation norms, attention entropy statistics, or controlled ablations isolating the memory module from other architectural changes is required to establish causality.

Authors: We agree that quantitative support is required. We will add (i) layer-wise L2 activation norm comparisons between the Engram-augmented model and the iso-FLOPs MoE baseline, (ii) attention entropy and head-wise focus statistics on long-context sequences, and (iii) a controlled ablation that inserts Engram while freezing all other architectural changes. These results will be presented in a new subsection of the mechanistic analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is empirical discovery followed by guided scaling

full rationale

The paper formulates the Sparsity Allocation problem, empirically uncovers a U-shaped scaling law from experiments on smaller scales, and then applies the observed law to select allocation when scaling Engram to 27B parameters. This sequence is a standard empirical finding followed by extrapolation and does not reduce any claimed performance gain to a fitted parameter or self-citation by construction. No self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The central claims rest on reported benchmark deltas against an iso-parameter baseline, which remain independently falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the newly introduced Engram module and an empirically observed scaling law whose parameters appear fitted to experimental outcomes rather than derived from first principles.

free parameters (1)

Sparsity allocation ratio
The balance between MoE computation and Engram memory is chosen according to the U-shaped curve, which is determined from scaling experiments.

axioms (1)

domain assumption Transformers lack a native primitive for knowledge lookup and must simulate retrieval through computation
Explicitly stated as the core motivation in the abstract.

invented entities (1)

Engram module no independent evidence
purpose: Provides O(1) deterministic lookup for conditional memory
New module introduced in this work with no independent evidence supplied outside the paper.

pith-pipeline@v0.9.0 · 5649 in / 1355 out tokens · 102098 ms · 2026-05-16T04:50:55.810210+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Geometric Factual Recall in Transformers
cs.CL 2026-05 conditional novelty 8.0

A single-layer transformer memorizes random subject-attribute bijections using logarithmic embedding dimension via linear superpositions in embeddings and ReLU-gated selection in the MLP, with zero-shot transfer to ne...
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
cs.LG 2026-05 unverdicted novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
Does Engram Do Memory Retrieval in Autoregressive Image Generation?
cs.CV 2026-05 accept novelty 7.0

Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.
NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining
cs.DC 2026-04 unverdicted novelty 7.0

NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency on 1,536 workers via dual-buffer inter-batch and frozen-window intra-batch pipelining that overlaps communication with computation.
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
cs.LG 2026-05 unverdicted novelty 6.0

PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
Conditional Memory Enhanced Item Representation for Generative Recommendation
cs.IR 2026-05 unverdicted novelty 6.0

ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.
Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model
eess.IV 2026-05 unverdicted novelty 6.0

Introduces the SMART-HC-VQA dataset with 65k single-image and 2.3M temporal VQA examples plus an adapted LLaVA-NeXT MLLM framework for geospatial-temporal sensemaking of remote sensing construction activity.
Contextual Memory-Enhanced Source Coding for Low-SNR Communications
cs.IT 2026-05 unverdicted novelty 6.0

MASC internalizes multi-order n-gram patterns via shared PCM and MMER routing to refine source probabilities, shorten codelengths, and reduce sensitivity to channel errors in SSCC for low-SNR regimes.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 6.0

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
cs.CL 2026-04 unverdicted novelty 6.0

Byte-level simulations show subword tokenization improves LLM training mainly via increased throughput and boundary priors.
Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling
cs.CL 2026-04 unverdicted novelty 6.0

X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scale...
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
cs.LG 2026-04 unverdicted novelty 6.0

MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density c...
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
cs.CL 2026-04 conditional novelty 6.0

Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-paramete...
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
cs.CL 2026-03 unverdicted novelty 6.0

Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to...
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
cs.CL 2026-03 unverdicted novelty 6.0

MSA is an end-to-end trainable memory model using sparse attention and document-wise RoPE that scales to 100M tokens with linear complexity and less than 9% degradation.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 5.0

PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
cs.CL 2026-04 unverdicted novelty 5.0

Subword tokenization's main benefits arise from higher sample throughput and the use of subword boundaries as explicit priors or inductive biases, isolated via controlled byte-level simulations.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
cs.MA 2026-04 unverdicted novelty 5.0

MemCoT redefines long-context reasoning as iterative stateful search with zoom-in/zoom-out memory perception and dual short-term memories, claiming SOTA results on LoCoMo and LongMemEval-S benchmarks.
Decidable By Construction: Design-Time Verification for Trustworthy AI
cs.PL 2026-03 unverdicted novelty 4.0

A type system over finitely generated abelian groups enables design-time verification of AI model properties and links Hindley-Milner unification to a restriction of Solomonoff's universal prior.