Recognition: 2 theorem links
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
Pith reviewed 2026-05-16 04:50 UTC · model grok-4.3
The pith
Engram introduces conditional memory as a new sparsity axis that lets large language models perform direct O(1) knowledge lookups instead of computing retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Conditional memory via Engram supplies an O(1) lookup primitive that complements mixture-of-experts computation. When sparsity is allocated according to the observed U-shaped law, scaling the memory module to 27 billion parameters produces higher scores than an iso-parameter, iso-FLOPs MoE baseline on reasoning tasks such as BBH and ARC-Challenge and on long-context retrieval such as Multi-Query NIAH. The module also enables deterministic prefetching with negligible runtime cost.
What carries the argument
Engram, a module that modernizes n-gram embeddings into a conditional memory table for single-step lookup.
Load-bearing premise
The measured gains come from the memory lookups themselves rather than from differences in training procedure or other architectural details.
What would settle it
An ablation that disables Engram while keeping every other component identical and then records no drop in BBH or Multi-Query NIAH scores would falsify the claim that conditional memory drives the improvements.
read the original abstract
While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic $N$-gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains~(HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone's early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0). Finally, Engram establishes infrastructure-aware efficiency: its deterministic addressing enables runtime prefetching from host memory, incurring negligible overhead. We envision conditional memory as an indispensable modeling primitive for next-generation sparse models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Engram, a scalable N-gram lookup module implementing conditional memory as a new sparsity axis complementary to Mixture-of-Experts (MoE) computation in Transformers. By formulating the Sparsity Allocation problem, the authors identify a U-shaped scaling law governing the optimal trade-off between MoE layers and Engram memory capacity. Guided by this law, they scale an Engram-augmented model to 27B parameters and report superior performance versus a strictly iso-parameter and iso-FLOPs MoE baseline, with gains on reasoning (BBH +5.0, ARC-Challenge +3.7), knowledge (MMLU +3.4), code/math (HumanEval +3.0, MATH +2.4), and long-context retrieval (Multi-Query NIAH 84.2 to 97.0). Mechanistic analyses claim that Engram relieves early-layer static reconstruction and frees attention for global context, while its deterministic addressing enables efficient host-memory prefetching.
Significance. If the U-shaped law generalizes and the performance deltas are causally attributable to the memory module rather than experimental confounds, the work would introduce a practical new modeling primitive that augments conditional computation with conditional memory. The infrastructure-aware efficiency claim and the observation of larger gains in reasoning than pure retrieval are potentially high-impact for scaling sparse LLMs. The paper ships concrete scaling results at 27B, which strengthens the empirical case if the allocation procedure is shown to be non-circular.
major comments (3)
- [Sparsity Allocation problem] Sparsity Allocation section: The U-shaped scaling law is presented as uncovered from the Sparsity Allocation problem and then used to select the MoE/Engram split at 27B parameters. The manuscript supplies no derivation details, fitting procedure, or independent validation set, leaving open whether the reported 27B allocation was determined a priori or fitted to the same performance data used to claim superiority over the iso-FLOPs MoE baseline.
- [Experimental results] Experimental results and baselines: Concrete deltas are reported (BBH +5.0, Multi-Query NIAH 84.2 to 97.0) yet the text provides no information on statistical significance, variance across seeds, exact baseline MoE implementation details, or how iso-FLOPs equivalence was enforced when adding the Engram module. These omissions are load-bearing for the central claim that conditional memory produces the observed gains.
- [Mechanistic analyses] Mechanistic analyses: The statements that Engram 'relieves the backbone's early layers from static reconstruction' and 'frees up attention capacity for global context' are supported only by qualitative description. Quantitative evidence such as layer-wise activation norms, attention entropy statistics, or controlled ablations isolating the memory module from other architectural changes is required to establish causality.
minor comments (2)
- [Abstract] The abstract states that Engram 'modernizes classic N-gram embedding' but does not specify the exact modifications (hashing scheme, embedding dimension scaling, or collision handling) that enable O(1) lookup at 27B scale; a brief equation or pseudocode would improve clarity.
- [Sparsity Allocation problem] Notation for the sparsity allocation ratio is introduced without an explicit equation linking it to the U-shaped law; adding a numbered equation would make the scaling claim easier to follow.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major point below and commit to revisions that strengthen the empirical and mechanistic claims without altering the core contributions.
read point-by-point responses
-
Referee: [Sparsity Allocation problem] Sparsity Allocation section: The U-shaped scaling law is presented as uncovered from the Sparsity Allocation problem and then used to select the MoE/Engram split at 27B parameters. The manuscript supplies no derivation details, fitting procedure, or independent validation set, leaving open whether the reported 27B allocation was determined a priori or fitted to the same performance data used to claim superiority over the iso-FLOPs MoE baseline.
Authors: We agree the current text omits these details. In revision we will add the full mathematical formulation of the Sparsity Allocation objective, the exact fitting procedure (including the held-out validation set and hyperparameter search protocol), and explicit confirmation that the 27B MoE/Engram split was selected using only the validation data before any final test-set evaluation. This will demonstrate the allocation was determined a priori relative to the reported benchmark numbers. revision: yes
-
Referee: [Experimental results] Experimental results and baselines: Concrete deltas are reported (BBH +5.0, Multi-Query NIAH 84.2 to 97.0) yet the text provides no information on statistical significance, variance across seeds, exact baseline MoE implementation details, or how iso-FLOPs equivalence was enforced when adding the Engram module. These omissions are load-bearing for the central claim that conditional memory produces the observed gains.
Authors: We acknowledge these omissions. The revised manuscript will report (i) p-values from paired statistical tests across the reported metrics, (ii) standard deviation over at least three independent random seeds for all main results, (iii) the precise MoE baseline architecture (number of experts, top-k, expert size) and training hyperparameters, and (iv) the exact FLOPs accounting that keeps total training and inference compute matched when Engram is added (by reducing MoE layer width proportionally). revision: yes
-
Referee: [Mechanistic analyses] Mechanistic analyses: The statements that Engram 'relieves the backbone's early layers from static reconstruction' and 'frees up attention capacity for global context' are supported only by qualitative description. Quantitative evidence such as layer-wise activation norms, attention entropy statistics, or controlled ablations isolating the memory module from other architectural changes is required to establish causality.
Authors: We agree that quantitative support is required. We will add (i) layer-wise L2 activation norm comparisons between the Engram-augmented model and the iso-FLOPs MoE baseline, (ii) attention entropy and head-wise focus statistics on long-context sequences, and (iii) a controlled ablation that inserts Engram while freezing all other architectural changes. These results will be presented in a new subsection of the mechanistic analysis. revision: yes
Circularity Check
No significant circularity; derivation is empirical discovery followed by guided scaling
full rationale
The paper formulates the Sparsity Allocation problem, empirically uncovers a U-shaped scaling law from experiments on smaller scales, and then applies the observed law to select allocation when scaling Engram to 27B parameters. This sequence is a standard empirical finding followed by extrapolation and does not reduce any claimed performance gain to a fitted parameter or self-citation by construction. No self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The central claims rest on reported benchmark deltas against an iso-parameter baseline, which remain independently falsifiable.
Axiom & Free-Parameter Ledger
free parameters (1)
- Sparsity allocation ratio
axioms (1)
- domain assumption Transformers lack a native primitive for knowledge lookup and must simulate retrieval through computation
invented entities (1)
-
Engram module
no independent evidence
Forward citations
Cited by 20 Pith papers
-
Geometric Factual Recall in Transformers
A single-layer transformer memorizes random subject-attribute bijections using logarithmic embedding dimension via linear superpositions in embeddings and ReLU-gated selection in the MLP, with zero-shot transfer to ne...
-
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
-
Does Engram Do Memory Retrieval in Autoregressive Image Generation?
Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.
-
NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining
NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency on 1,536 workers via dual-buffer inter-batch and frozen-window intra-batch pipelining that overlaps communication with computation.
-
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
-
Conditional Memory Enhanced Item Representation for Generative Recommendation
ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.
-
Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model
Introduces the SMART-HC-VQA dataset with 65k single-image and 2.3M temporal VQA examples plus an adapted LLaVA-NeXT MLLM framework for geospatial-temporal sensemaking of remote sensing construction activity.
-
Contextual Memory-Enhanced Source Coding for Low-SNR Communications
MASC internalizes multi-order n-gram patterns via shared PCM and MMER routing to refine source probabilities, shorten codelengths, and reduce sensitivity to channel errors in SSCC for low-SNR regimes.
-
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
-
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
Byte-level simulations show subword tokenization improves LLM training mainly via increased throughput and boundary priors.
-
Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling
X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scale...
-
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density c...
-
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-paramete...
-
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to...
-
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
MSA is an end-to-end trainable memory model using sparse attention and document-wise RoPE that scales to 100M tokens with linear complexity and less than 9% degradation.
-
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.
-
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
Subword tokenization's main benefits arise from higher sample throughput and the use of subword boundaries as explicit priors or inductive biases, isolated via controlled byte-level simulations.
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
-
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
MemCoT redefines long-context reasoning as iterative stateful search with zoom-in/zoom-out memory perception and dual short-term memories, claiming SOTA results on LoCoMo and LongMemEval-S benchmarks.
-
Decidable By Construction: Design-Time Verification for Trustworthy AI
A type system over finitely generated abelian groups enables design-time verification of AI model properties and links Hindley-Milner unification to a restriction of Solomonoff's universal prior.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.