hub Mixed citations

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie · 2026 · cs.CL · arXiv 2601.07372

Mixed citation behavior. Most common role is background (62%).

33 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 33 citing papers arXiv PDF

abstract

While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic $N$-gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains~(HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone's early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0). Finally, Engram establishes infrastructure-aware efficiency: its deterministic addressing enables runtime prefetching from host memory, incurring negligible overhead. We envision conditional memory as an indispensable modeling primitive for next-generation sparse models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 method 2

citation-polarity summary

background 5 use method 2 support 1

representative citing papers

Geometric Factual Recall in Transformers

cs.CL · 2026-05-12 · conditional · novelty 8.0

A single-layer transformer memorizes random subject-attribute bijections using logarithmic embedding dimension via linear superpositions in embeddings and ReLU-gated selection in the MLP, with zero-shot transfer to new facts and matching multi-hop constructions.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

LLMs Need Encoders for Semantic IDs Too

cs.IR · 2026-05-29 · unverdicted · novelty 7.0

PrefixMem encoder for Semantic IDs improves deepest-level accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative on Pinterest data across LLM families.

Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Tiny-Engram uses small n-gram-indexed memory tables to bind trigger phrases to target visual identities in diffusion models while preserving compositional control from the surrounding prompt.

Does Engram Do Memory Retrieval in Autoregressive Image Generation?

cs.CV · 2026-05-13 · accept · novelty 7.0

Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.

NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining

cs.DC · 2026-04-08 · unverdicted · novelty 7.0

NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency on 1,536 workers via dual-buffer inter-batch and frozen-window intra-batch pipelining that overlaps communication with computation.

Mechanism-Driven Monitors for Preemptive Detection of LLM Training Instability

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

Derives mechanism-based monitors from module functional roles and validates them via fault-injection experiments showing early detection of LLM training instability.

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

cs.CL · 2026-06-11 · unverdicted · novelty 6.0

RA-RFT trains a retriever to rank contexts by expected reasoning benefit and uses the retrieved analogies inside reinforcement fine-tuning, yielding 7.1 and 2.8 point gains on AIME 2025 over GRPO for two Qwen3 models.

Do Value Vectors in Deep Layers Need Context from the Residual Stream?

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

Deeper transformer layers benefit from context-free token-specific value vectors in a Bank of Values lookup table, improving performance over standard attention with less compute.

When Data Is Scarce: Scaling Sparse Language Models with Repeated Training

cs.LG · 2026-05-31 · unverdicted · novelty 6.0

Sparse LLMs in data-scarce multi-epoch regimes follow a scaling law based on active parameters, unique tokens, repetition count, and sparsity level that predicts performance and delays data saturation.

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

Tensor Memory augments Transformers with a constant-size 3D voxel grid using differentiable soft writes at predicted locations, local interaction, and gated recurrent dynamics to decouple memory capacity from sequence length.

Simulating Human Memory with Language Models

cs.CL · 2026-05-25 · unverdicted · novelty 6.0

Language models show superior memory to humans on psych experiments but can be adjusted via prompting and compaction to forget more human-like, yielding better user simulators.

Lngram: N-gram Conditional Memory in Latent Space

cs.CL · 2026-05-24 · unverdicted · novelty 6.0

Lngram is a latent N-gram conditional memory module that learns discrete symbols from hidden states for N-gram lookup, outperforming baselines in language modeling and multimodal tasks.

Mem-$\pi$: Adaptive Memory through Learning When and What to Generate

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.

Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

Memory Grafting improves language-model benchmarks by grafting offline hidden-state memory from a larger model into a recipient model using n-gram lookups and lightweight adapters, outperforming MoE and vanilla Engram baselines at 0.92B and 2.8B scales.

HRM-Text: Efficient Pretraining Beyond Scaling

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

A 1B-parameter hierarchical recurrent model pretrained on 40B instruction-response tokens achieves 60.7% MMLU and strong results on ARC-C, DROP, GSM8K, and MATH while using 100-900x fewer tokens than standard baselines.

Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.

Conditional Memory Enhanced Item Representation for Generative Recommendation

cs.IR · 2026-05-12 · unverdicted · novelty 6.0

ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.

Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

eess.IV · 2026-05-11 · unverdicted · novelty 6.0

Introduces the SMART-HC-VQA dataset with 65k single-image and 2.3M temporal VQA examples plus an adapted LLaVA-NeXT MLLM framework for geospatial-temporal sensemaking of remote sensing construction activity.

Contextual Memory-Enhanced Source Coding for Low-SNR Communications

cs.IT · 2026-05-06 · unverdicted · novelty 6.0

MASC internalizes multi-order n-gram patterns via shared PCM and MMER routing to refine source probabilities, shorten codelengths, and reduce sensitivity to channel errors in SSCC for low-SNR regimes.

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

cs.CV · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

cs.CL · 2026-04-29 · unverdicted · novelty 6.0 · 2 refs

Byte-level simulations show subword tokenization improves LLM training mainly via increased throughput and boundary priors.

Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling

cs.CL · 2026-04-23 · unverdicted · novelty 6.0

X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scales using 50% smaller tables.

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer