Lngram: N-gram Conditional Memory in Latent Space
Pith reviewed 2026-06-30 12:24 UTC · model grok-4.3
The pith
Lngram learns discrete symbols from hidden states to enable N-gram lookups in latent space, decoupling retrieval from tokenization and extending to multimodal data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lngram is a latent-space conditional memory module that learns discrete symbols directly from hidden states and performs N-gram lookup over these symbols. This design removes the dependence on tokenizer IDs and naturally extends to non-text modalities. In our evaluated settings, Lngram outperforms Transformer and Engram baselines, consistently reduces perplexity in long-context language modeling, and effectively injects domain knowledge when added post hoc to pretrained models. Joint training with the backbone further surpasses full fine-tuning, while experiments on vision-language and vision-language-action tasks show overall gains.
What carries the argument
Lngram module: a latent-space conditional memory that extracts discrete symbols from hidden states and performs N-gram lookup over those symbols.
If this is right
- Consistently reduces perplexity in long-context language modeling.
- Effectively injects domain knowledge when added post hoc to pretrained models.
- Joint training with the backbone surpasses full fine-tuning.
- Shows overall gains on vision-language and vision-language-action tasks.
- Enables prediction-relevant information to emerge earlier in the network layers.
Where Pith is reading between the lines
- The token-free design could be applied to non-text sequences such as audio waveforms or biological strings to test cross-modal generality.
- Earlier emergence of useful signals might allow shorter context windows or shallower stacks to reach the same accuracy.
- The learned discrete symbols could be inspected to see whether they correspond to interpretable concepts or patterns inside the model.
- Similar latent-symbol memory could be attached to non-Transformer backbones to check whether the benefit is architecture-specific.
Load-bearing premise
That discrete symbols learned directly from hidden states can support effective N-gram conditional memory without the information loss or dependence issues of token-based keys.
What would settle it
If inserting Lngram into a pretrained model produces no reduction in perplexity on long-context language modeling benchmarks or no gains on vision-language tasks relative to the Engram baseline, the performance advantage would be falsified.
Figures
read the original abstract
Sequence modeling requires both compositional reasoning and local static knowledge retrieval, yet standard Transformers handle both through dense computation. Engram partially decouples retrieval from the backbone, but its token-based keys remain tied to text tokenization and hash compression. We propose Lngram, a latent-space conditional memory module that learns discrete symbols directly from hidden states and performs N-gram lookup over these symbols. This design removes the dependence on tokenizer IDs and naturally extends to non-text modalities. In our evaluated settings, Lngram outperforms Transformer and Engram baselines, consistently reduces perplexity in long-context language modeling, and effectively injects domain knowledge when added post hoc to pretrained models. Joint training with the backbone further surpasses full fine-tuning, while experiments on vision-language and vision-language-action tasks show overall gains. Analyses with LogitLens and CKA suggest that Lngram enables prediction-relevant information to emerge earlier, increasing effective depth with limited inference and memory overhead. Code is available at https://github.com/zyaaa-ux/Lngram.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Lngram, a latent-space conditional memory module that learns discrete symbols directly from hidden states and performs N-gram lookup over them, decoupling retrieval from token IDs and extending to non-text modalities. It claims that in evaluated settings Lngram outperforms Transformer and Engram baselines, reduces perplexity in long-context language modeling, injects domain knowledge post hoc into pretrained models, surpasses full fine-tuning when jointly trained, yields gains on vision-language and vision-language-action tasks, and enables earlier emergence of prediction-relevant information per LogitLens and CKA analyses, all with limited inference and memory overhead.
Significance. If the empirical results hold with proper controls, the work offers a practical route to add static N-gram-style knowledge retrieval to dense models without tokenizer dependence or full retraining, potentially increasing effective depth at modest cost and broadening applicability beyond text.
minor comments (2)
- The abstract states performance gains and architectural advantages but supplies no dataset names, model sizes, training details, error bars, or ablation controls; these must be added to the experimental section for the claims to be verifiable.
- The description of the discretization step and N-gram lookup mechanism would benefit from an explicit equation or pseudocode block showing how hidden-state symbols are obtained and indexed.
Simulated Author's Rebuttal
We thank the referee for their summary of the paper and for noting the potential significance of adding static N-gram-style knowledge retrieval to dense models without tokenizer dependence. We appreciate the 'uncertain' recommendation and welcome the opportunity to clarify any aspects of the empirical results.
Circularity Check
No significant circularity
full rationale
The paper presents Lngram as an empirical architecture proposal: a latent-space discretization module for N-gram lookup that is evaluated on language modeling, vision-language, and related tasks. No equations, derivations, or first-principles claims appear in the provided abstract or description. Performance assertions are framed as experimental outcomes rather than reductions from fitted parameters or self-citations. The work is self-contained against external benchmarks (Transformer and Engram baselines) with no load-bearing self-referential steps visible.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Lessons from the Trenches on Reproducible Evaluation of Language Models
Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., Van Der Wal, O., et al. Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Accessed: 2026-04-
GitHub repository directory. Accessed: 2026-04-
2026
-
[3]
C., Xu, P., Och, F
Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J. Large language models in machine translation. InProceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natu- ral Language Learning, pp. 858–867,
2007
-
[4]
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
Cheng, X., Zeng, W., Dai, D., Chen, Q., Wang, B., Xie, Z., Huang, K., Yu, X., Hao, Z., Li, Y ., Zhang, H., Zhang, H., Zhao, D., and Liang, W. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., Xie, Z., Li, Y . K., Huang, P., Luo, F., Ruan, C., Sui, Z., and Liang, W. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI, Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Deng, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Ji, D., Li, E., Lin, F., Luo, F., Hao, G., Chen, G., Li, G., et al. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
BERT: Pre-training of deep bidirectional transformers for lan- guage understanding
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pp. 4171–4186,
2019
-
[8]
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y ., and Stone, P. LIBERO: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023a. Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023b. nostalgebraist. Interpreting GPT: The logit lens....
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Accessed: 2026-04-21. Penedo, G., Kydl ´ıˇcek, H., Ben Allal, L., Lozhkov, A., Mitchell, M., Raffel, C., von Werra, L., and Wolf, T. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557,
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence, Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M. Y ., Ghosh, D., Groom, L., Haus- man, K., Ichter, B., et al. π0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Qwen, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
StarVLA Community. StarVLA: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., H ´enaff, O., Harmsen, J., Steiner, A., and Zhai, X. SigLIP 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense features.arXiv preprint arXiv:2502.14786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
A study of situational reasoning for traffic understanding.arXiv preprint arXiv:2306.02520,
Zhang, J., Ilievski, F., Ma, K., Kollaa, A., Francis, J., and Oltramari, A. A study of situational reasoning for traffic understanding.arXiv preprint arXiv:2306.02520,
-
[17]
With windowed attention alone, the model’s average score drops from 59.21 under global attention to 29.41, indicating that windowed attention cannot cover the long-range information required by 17 Lngram: N-gram Conditional Memory in Latent Space Table 12.Results of the binary-state retrieval model on general language understanding tasks. Model HellaSwag ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.