pith. sign in

arxiv: 2606.08347 · v1 · pith:WAU53BJ3new · submitted 2026-06-06 · 💻 cs.CL · cs.LG

Tensorizing Engram: Sharing Latents Across N-Gram Embeddings is Beneficial in LLMs

Pith reviewed 2026-06-27 19:26 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords n-gram embeddingstensor decompositionCanonical Polyadiclanguage modelsparameter efficiencymulti-token patternsshared latents
0
0 comments X

The pith

Tensorizing n-gram embeddings with shared CP factors lets LLMs match Engram performance using far fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern language models learn multi-token patterns implicitly across layers because they use only token-level embeddings. Previous attempts like Engram add explicit n-gram memories but use separate hash tables for each order, causing collisions and blocking shared latents for nested n-grams. TN-gram instead decomposes the n-gram embeddings tensor in Canonical Polyadic form, sharing token-position factors across orders while using order-absorption vectors to differentiate them. This yields a compact module that experiments show matches or beats the earlier approach at much lower parameter count. The insight matters if true because it points to a scalable way to inject phrase-level memory into transformers without the usual memory penalty.

Core claim

TN-gram represents tensorized n-gram embeddings through shared factors in the Canonical Polyadic (CP) form. It learns shared token-position factors together with order-absorption vectors to encode the embeddings of different n-gram orders. Comprehensive experiments demonstrate that TN-gram matches or even outperforms Engram-style n-gram modules while requiring much fewer parameters.

What carries the argument

Canonical Polyadic decomposition of the n-gram embedding tensor with shared token-position factors and order-absorption vectors that allow different orders to be encoded efficiently.

If this is right

  • Multi-token patterns can be stored explicitly with reduced parameter overhead.
  • Nested n-grams benefit from shared latent structures rather than independent tables.
  • Hash collisions associated with per-order tables are sidestepped.
  • Overall model capacity for language tasks is preserved or improved at lower memory cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar sharing mechanisms could apply to other tensor-structured linguistic features such as dependency trees.
  • Higher-order n-grams become feasible without exponential growth in storage.
  • The method may integrate with existing transformer architectures to improve handling of idiomatic expressions.

Load-bearing premise

The Canonical Polyadic decomposition with shared token-position factors and order-absorption vectors can represent the necessary distinctions among nested n-grams without the collisions or capacity loss that separate hash tables were introduced to avoid.

What would settle it

Training and evaluating TN-gram versus a hash-based n-gram module on a corpus rich in overlapping n-grams and measuring whether TN-gram shows higher perplexity or lower accuracy on predictions involving those overlaps.

Figures

Figures reproduced from arXiv: 2606.08347 by Danilo Mandic, Giorgos Iacovides, Qibin Zhao, Wuyang Zhou, Yuning Qiu, Yuxuan Gu.

Figure 1
Figure 1. Figure 1: Architecture of Engram (Cheng et al., 2026) and the proposed TN-gram. Both methods inject explicit N-gram memory into selected Transformer blocks. Engram uses separate per-order, per-head hash tables En,k, while TN-gram replaces them with a tensorized CP memory module over shared factors ,A1, . . . , AN , and order-absorption vectors, w1, . . . , wN−2, enabling cross-order latent sharing within differernt … view at source ↗
Figure 2
Figure 2. Figure 2: Left: Validation loss of the proposed TN-gram improves as N-gram order increases, with most of the gain achieved by 5-gram and little improvement thereafter. Right: Parameter count of both methods grows roughly linearly with N, with TN-gram using fewer parameters than the Engram baseline for the same N, with a CP tensor rank of 1200. To obtain the embedding for a specific order-n context (x1, . . . , xn), … view at source ↗
Figure 3
Figure 3. Figure 3: shows a clear difference between two scaling di￾rections. Increasing the number of Engram hashing heads yields modest improvements in validation loss, suggesting that duplicating independent hash memories has diminish￾ing returns. This is consistent with Engram’s design, in which additional heads increase the number of retrieved slots, but do not leverage the shared latent structure across nested n-gram co… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation Study. Left: Loss reduction relative to raw GPT for Engram and TN-gram for the vocabulary size 8192 with all 5-grams. Right: Validation loss of N-gram order for TN-gram ablations, showing effects of removing RMS normalization, learnable scale {sk} N k=2, and order-absorption vectors {wk} N−n k=1 . ing training loss by 4.27% and validation loss by 4.44%, compared to 3.99% and 4.17% for Engram. This… view at source ↗
Figure 5
Figure 5. Figure 5: Final-step performance of 18-layer transformer models across n-gram orders N. Left: TN-gram yields the lowest validation loss than Engram and raw GPT baseline. The tensor rank in TN-gram is configured such that it always has less parameters than Engram. Right: TN-gram achieves higher CORE score, with strongest gain at larger N. C. Scaling n-gram Order in a 19-Layer Transformer To evaluate the scaling perfo… view at source ↗
read the original abstract

Modern language models represent text using discrete token-level embeddings, which forces recurring multi-token patterns to be learned implicitly across Transformer layers. Both Over-tokenized Transformers and Engram attempt to address this limitation by explicitly incorporating multi-token (n-gram) memories. However, they rely on separate hash tables for each n-gram order, which introduces hash collisions and prevents nested n-grams from sharing the underlying latent structures. To address these issues, we propose Tensorized Engram (TN-gram), a compact memory module that represents tensorized n-gram embeddings through shared factors in the Canonical Polyadic (CP) form. TN-gram learns shared token-position factors together with order-absorption vectors to encode the embeddings of different n-gram order. Comprehensive experiments demonstrate that TN-gram matches or even outperforms Engram-style n-gram modules while requiring much fewer parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Tensorized Engram (TN-gram) as a compact n-gram memory module for LLMs. It replaces per-order hash tables (as in Engram) with a Canonical Polyadic (CP) tensor decomposition that shares token-position factors across n-gram orders while using per-order absorption vectors to encode order-specific information. The central claim is that this sharing reduces parameters, avoids hash collisions, and enables latent sharing for nested n-grams, with comprehensive experiments showing that TN-gram matches or outperforms Engram-style modules.

Significance. If the empirical results hold and the representational capacity concern is addressed, the work would demonstrate a parameter-efficient tensor-based alternative for explicit multi-token memory in transformers. The approach directly targets the collision and non-sharing issues of prior hash-table methods while preserving the benefit of order-specific n-gram embeddings.

major comments (2)
  1. [Method (CP decomposition and absorption mechanism)] The load-bearing assumption that the CP factorization (shared token-position factors + order-absorption vectors) can recover the order-specific distinctions among nested n-grams without rank collapse or crosstalk is not automatically guaranteed by the CP form. If absorption vectors are limited-rank or act via simple scaling/addition, the shared factors may force collisions precisely on the nested cases the method targets; this requires either a theoretical argument or targeted ablation showing preserved distinctions.
  2. [Abstract and Experiments section] The abstract states that comprehensive experiments demonstrate the performance claim, yet provides no quantitative results, ablation details on tensor rank, or description of how absorption vectors or rank were selected. Without these, it is impossible to assess whether the reported gains are robust or post-hoc.
minor comments (1)
  1. [Method] Notation for the CP factors and absorption vectors should be introduced with explicit equations early in the method section to clarify how the shared factors interact with per-order terms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond point-by-point to the major comments below, indicating planned revisions.

read point-by-point responses
  1. Referee: [Method (CP decomposition and absorption mechanism)] The load-bearing assumption that the CP factorization (shared token-position factors + order-absorption vectors) can recover the order-specific distinctions among nested n-grams without rank collapse or crosstalk is not automatically guaranteed by the CP form. If absorption vectors are limited-rank or act via simple scaling/addition, the shared factors may force collisions precisely on the nested cases the method targets; this requires either a theoretical argument or targeted ablation showing preserved distinctions.

    Authors: We appreciate the referee's point on the representational properties of the CP form. TN-gram is motivated by the observation that order-absorption vectors can modulate shared factors to encode order-specific information without requiring separate tables. While the current manuscript does not include a formal proof against crosstalk, the empirical results across multiple tasks show that TN-gram matches or exceeds the performance of per-order hash tables, indicating that distinctions are preserved in practice. To directly address the concern, we will add a targeted ablation examining the effect of absorption vector rank on the model's ability to distinguish nested n-grams. revision: yes

  2. Referee: [Abstract and Experiments section] The abstract states that comprehensive experiments demonstrate the performance claim, yet provides no quantitative results, ablation details on tensor rank, or description of how absorption vectors or rank were selected. Without these, it is impossible to assess whether the reported gains are robust or post-hoc.

    Authors: We agree that the abstract would benefit from greater specificity. In the revised manuscript we will update the abstract to include key quantitative results (e.g., parameter reduction and perplexity or downstream metrics) and a brief statement on the tensor-rank selection procedure used in the experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal validated by independent experiments

full rationale

The paper presents TN-gram as a new memory module based on CP decomposition with shared factors; its performance claims rest on empirical comparisons to Engram-style baselines rather than any derivation, fitted parameter, or self-citation that reduces the reported outcome to an input by construction. No load-bearing step equates the claimed benefit to a quantity defined inside the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities beyond the standard assumption that a low-rank CP factorization can approximate the desired n-gram embedding table.

pith-pipeline@v0.9.1-grok · 5691 in / 1150 out tokens · 15806 ms · 2026-06-27T19:26:14.176500+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 6 canonical work pages

  1. [1]

    arXiv preprint arXiv:2410.17765 , year=

    Faster language models with better multi-token prediction using tensor decomposition , author=. arXiv preprint arXiv:2410.17765 , year=

  2. [2]

    Liu, Hong and Zhang, Jiaqi and Wang, Chao and Hu, Xing and Lyu, Linkun and Sun, Jiaqi and Yang, Xurui and Wang, Bo and Li, Fengcun and Qian, Yulei and others , journal=

  3. [3]

    Proceedings of the 42nd International Conference on Machine Learning , pages =

    Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

  4. [4]

    Cheng, Xin and Zeng, Wangding and Dai, Damai and Chen, Qinyu and Wang, Bingxuan and Xie, Zhenda and Huang, Kezhao and Yu, Xingkai and Hao, Zhewen and Li, Yukun and others , journal=

  5. [5]

    2025 , publisher =

    Andrej Karpathy , title =. 2025 , publisher =

  6. [6]

    2026 , publisher =

    OpenAI , title =. 2026 , publisher =

  7. [7]

    The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    Guilherme Penedo and Hynek Kydl. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  8. [8]

    Manning, Christopher and Schutze, Hinrich , year=

  9. [9]

    1999 , publisher=

    Chen, Stanley F and Goodman, Joshua , journal=. 1999 , publisher=

  10. [10]

    Pauls, Adam and Klein, Dan , booktitle=

  11. [11]

    2015 , volume=

    Cichocki, Andrzej and Mandic, Danilo and De Lathauwer, Lieven and Zhou, Guoxu and Zhao, Qibin and Caiafa, Cesar and PHAN, HUY ANH , journal=. 2015 , volume=

  12. [12]

    2016 , publisher=

    Cichocki, Andrzej and Lee, Namgil and Oseledets, Ivan and Phan, Anh-Huy and Zhao, Qibin and Mandic, Danilo P , journal=. 2016 , publisher=

  13. [13]

    Gu, Yuxuan and Zhou, Wuyang and Iacovides, Giorgos and Mandic, Danilo , journal =

  14. [14]

    Gu, Yuxuan and Zhou, Wuyang and Iacovides, Giorgos and Mandic, Danilo , booktitle=. Te

  15. [15]

    Zhou, Wuyang and Gu, Yuxuan and Iacovides, Giorgos and Mandic, Danilo , booktitle=. Krom

  16. [16]

    Iacovides, Giorgos and Zhou, Wuyang and Li, Chao and Zhao, Qibin and Mandic, Danilo , journal=

  17. [17]

    Zhou, Wuyang and Iacovides, Giorgos and Konstantinidis, Kriton and Kisil, Ilya and Mandic, Danilo , journal=

  18. [18]

    Iacovides, Giorgos and Zhou, Wuyang and Mandic, Danilo , journal=

  19. [19]

    Advances in Neural Information Processing Systems , volume=

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Advances in Neural Information Processing Systems , volume=

  20. [20]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=

  21. [21]

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others , journal=

  22. [22]

    2001 , publisher=

    Goodman, Joshua T , journal=. 2001 , publisher=

  23. [23]

    Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=

  24. [24]

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=

  25. [25]

    Gordon, Andrew and Kozareva, Zornitsa and Roemmele, Melissa. * SEM 2012: The First Joint Conference on Lexical and Computational Semantics -- Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation ( S em E val 2012). 2012

  26. [26]

    2019 , address =

    Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1421

  27. [27]

    Bisk, Yonatan and Zellers, Rowan and Gao, Jianfeng and Choi, Yejin and others , booktitle=

  28. [28]

    Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish , booktitle=

  29. [29]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Paperno, Denis and Kruszewski, Germ \'a n and Lazaridou, Angeliki and Pham, Ngoc Quan and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern \'a ndez, Raquel. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1144

  30. [30]

    SQ u AD : 100,000+ Questions for Machine Comprehension of Text

    Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1264

  31. [31]

    Transactions of the Association for Computational Linguistics

    Reddy, Siva and Chen, Danqi and Manning, Christopher D. Transactions of the Association for Computational Linguistics. 2019. doi:10.1162/tacl_a_00266

  32. [32]

    Levesque, Hector J and Davis, Ernest and Morgenstern, Leora , journal=

  33. [33]

    B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

    Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1300

  34. [34]

    Zhong, Wanjun and Cui, Ruixiang and Guo, Yiduo and Liang, Yaobo and Lu, Shuai and Wang, Yanlin and Saied, Amin and Chen, Weizhu and Duan, Nan , booktitle=

  35. [35]

    Transactions on Machine Learning Research , year=

    Srivastava, Aarohi and Rastogi, Abhinav and Rao, Abhishek and Shoeb, Abu Awal Md and Abid, Abubakar and Fisch, Adam and Brown, Adam R and Santoro, Adam and Gupta, Aditya and Garriga-Alonso, Adri. Transactions on Machine Learning Research , year=

  36. [36]

    2021 , publisher=

    Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin , journal=. 2021 , publisher=

  37. [37]

    Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir and Bansal, Hritik and Guha, Etash and Keh, Sedrick and Arora, Kushal and others , journal=

  38. [38]

    2024 , url =

    Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

  39. [39]

    Kingma, Diederik P and Ba, Jimmy , journal=

  40. [40]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Ainslie, Joshua and Lee-Thorp, James and De Jong, Michiel and Zemlyanskiy, Yury and Lebr. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  41. [41]

    Llama Team , journal=

  42. [42]

    and Bader, Brett W

    Kolda, Tamara G. and Bader, Brett W. , title =. SIAM Review , volume =. 2009 , doi =

  43. [43]

    Zhou, Yanqi and Lei, Tao and Liu, Hanxiao and Du, Nan and Huang, Yanping and Zhao, Vincent and Dai, Andrew M and Le, Quoc V and Laudon, James and others , journal=

  44. [44]

    2017 , url=

    Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc Le and Geoffrey Hinton and Jeff Dean , booktitle=. 2017 , url=

  45. [45]

    2017 , publisher=

    Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas , journal=. 2017 , publisher=

  46. [46]

    Brants, Thorsten and Popat, Ashok and Xu, Peng and Och, Franz Josef and Dean, Jeffrey , booktitle=

  47. [47]

    Nguyen, Timothy , journal=

  48. [48]

    Yu, Da and Cohen, Edith and Ghazi, Badih and Huang, Yangsibo and Kamath, Pritish and Kumar, Ravi and Liu, Daogao and Zhang, Chiyuan , journal=

  49. [49]

    Pagnoni, Artidoro and Pasunuru, Ramakanth and Rodriguez, Pedro and Nguyen, John and Muller, Benjamin and Li, Margaret and Zhou, Chunting and Yu, Lili and Weston, Jason E and Zettlemoyer, Luke and others , booktitle=

  50. [50]

    Tito Svenstrup, Dan and Hansen, Jonas and Winther, Ole , journal=

  51. [51]

    Xu, Mingxue and Xu, Yao Lei and Mandic, Danilo P , journal=

  52. [52]

    Novikov, Alexander and Podoprikhin, Dmitrii and Osokin, Anton and Vetrov, Dmitry P , journal=

  53. [53]

    Frank Wilcoxon , journal =

  54. [54]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

    Yang, Yifan and Zhou, Jiajun and Wong, Ngai and Zhang, Zheng. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.174