pith. sign in

arxiv: 2605.20948 · v1 · pith:V4PU5NGAnew · submitted 2026-05-20 · 💻 cs.CL

Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

Pith reviewed 2026-05-21 05:46 UTC · model grok-4.3

classification 💻 cs.CL
keywords Memory Graftingconditional memoryn-gram memorylanguage model scalingexternal memoryMoEEngramoffline pre-training
0
0 comments X

The pith

Memory Grafting reuses frozen hidden states from a grafting model as external n-gram memory to scale language model capacity with low overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Memory Grafting to make conditional memory scaling in language models more practical than learning large tables from scratch during pre-training. It runs a separate grafting model offline on frequent local n-grams, stores their final-token hidden representations as memory values, and lets the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates inside the recipient model, with a hash-based Engram fallback for unmatched contexts. Experiments under matched architectures and budgets show gains over both MoE and vanilla Engram baselines, reaching an average benchmark score of 53.86 at 2.8B scale. This positions pretrained models as reusable builders of external latent memory, expanding capacity beyond trainable parameters alone.

Core claim

Memory Grafting constructs conditional n-gram memory by running a frozen grafting model offline on frequent local n-grams, storing final-token hidden states as reusable values, and retrieving them in the recipient model via exact longest-match suffix lookup followed by adaptation through lightweight projections and gates plus a hash-based Engram fallback.

What carries the argument

Offline conditional n-gram memory built from final-token hidden states of a grafting model and retrieved by exact longest-match suffix lookup.

If this is right

  • External latent capacity expands with limited training and inference overhead relative to learning memory tables from scratch.
  • Average benchmark scores rise from 51.95 for MoE and 52.43 for vanilla Engram to 53.86 in the 2.8B-scale setting.
  • All grafting-model variants outperform baselines in the 0.92B-scale experiments, with larger grafting models yielding stronger gains.
  • Pretrained models can serve as reusable constructors of external latent memory for future scaling beyond trainable parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Combining memory banks from several grafting models could let a single recipient cover multiple specialized domains without extra training.
  • The method might let a small recipient model approach the performance of a much larger model by grafting memory from it.
  • Replacing exact suffix lookup with approximate or learned retrieval could raise coverage for rare or long contexts.

Load-bearing premise

Final-token hidden states produced by the grafting model on frequent local n-grams remain useful and transferable when retrieved via exact longest-match suffix lookup and adapted only by lightweight projections and gates inside the recipient model.

What would settle it

An ablation experiment on the same training data and recipient architecture where grafted memory retrieval is replaced by random vectors or disabled entirely, then checking whether benchmark gains over vanilla Engram disappear.

read the original abstract

Scaling conditional memory offers a promising way to increase language-model capacity, but existing methods such as Engram learn large memory tables from scratch during pre-training, making memory scaling expensive and sometimes ineffective. We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory. Given frequent local n-grams, we run the grafting model offline, store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts. Since the grafting model is only run offline and exact lookup has expected O(1) complexity with respect to memory-bank size, Memory Grafting expands external latent capacity with limited training and inference overhead. Experiments under matched recipient architectures and pre-training budgets show that Memory Grafting improves over both MoE and vanilla Engram baselines. In the 2.8B-scale setting, it improves the average benchmark score from 51.95 for MoE and 52.43 for vanilla Engram to 53.86. In the 0.92B-scale setting, all grafting-model variants improve over the baselines, with Qwen3.5-35B-A3B giving the strongest gains. These results suggest that pretrained models can serve as reusable constructors of external latent memory, providing a practical step toward scaling future language models beyond trainable parameters alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Memory Grafting as a conditional memory scaling technique for language model pre-training. It computes final-token hidden states offline from a frozen larger grafting model on frequent n-grams, stores them as memory values, and enables retrieval in a smaller recipient model via exact longest-match suffix lookup. Retrieved states are adapted using lightweight projections and gates, with a hash-based Engram fallback for unmatched contexts. Experiments under matched architectures and budgets report benchmark gains over MoE and vanilla Engram baselines, e.g., average score rising from 51.95/52.43 to 53.86 at 2.8B scale and consistent improvements at 0.92B scale with stronger grafting models.

Significance. If the gains are attributable to the specific transferable representations from the grafting model rather than added parameters or coverage alone, the approach provides an efficient route to external latent memory that reuses pretrained models as constructors, reducing the cost of learning large memory tables from scratch and supporting capacity scaling beyond trainable parameters.

major comments (2)
  1. [§4] §4 (Experiments): The reported benchmark improvements (e.g., 51.95 for MoE and 52.43 for vanilla Engram to 53.86) are presented without error bars, multiple random seeds, or statistical tests, leaving open the possibility that observed differences fall within training variance and weakening the empirical support for the central scaling claim.
  2. [§3 and §4.2] §3 (Method) and §4.2 (Ablations): The key assumption that grafting-model final-token hidden states remain useful under exact suffix lookup and lightweight adaptation is load-bearing, yet no controls (e.g., random vectors, recipient self-states, or alignment metrics) or ablations isolating grafted content from the added projection/gate parameters and fallback mechanism are described; without these, gains could arise from capacity rather than the pretrained states, undermining the 'reusable constructors of external latent memory' argument.
minor comments (2)
  1. [§3] The description of expected O(1) lookup complexity with respect to memory-bank size would benefit from an explicit statement of the hash-table implementation and worst-case behavior.
  2. [§2] Notation for the grafting model variants (e.g., Qwen3.5-35B-A3B) and recipient scales should be introduced with a table for clarity when first mentioned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate revisions to the manuscript where appropriate.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The reported benchmark improvements (e.g., 51.95 for MoE and 52.43 for vanilla Engram to 53.86) are presented without error bars, multiple random seeds, or statistical tests, leaving open the possibility that observed differences fall within training variance and weakening the empirical support for the central scaling claim.

    Authors: We agree that reporting results from single training runs without error bars or statistical tests is a limitation. Pre-training at the 2.8B scale under matched budgets is computationally expensive, which limited us to one run per configuration. That said, the gains appear consistently across two model scales (0.92B and 2.8B) and across grafting models of different strengths, with larger improvements from stronger grafting models. In the revised manuscript we will add a paragraph in §4 explicitly noting the single-run limitation and highlighting the cross-scale and cross-grafting-model consistency as supporting evidence for the scaling claim. revision: yes

  2. Referee: [§3 and §4.2] §3 (Method) and §4.2 (Ablations): The key assumption that grafting-model final-token hidden states remain useful under exact suffix lookup and lightweight adaptation is load-bearing, yet no controls (e.g., random vectors, recipient self-states, or alignment metrics) or ablations isolating grafted content from the added projection/gate parameters and fallback mechanism are described; without these, gains could arise from capacity rather than the pretrained states, undermining the 'reusable constructors of external latent memory' argument.

    Authors: We partially addressed the concern by comparing against the vanilla Engram baseline, which uses identical projection/gate parameters and the same hash-based fallback but learns memory values from scratch rather than grafting pretrained states. We also report that stronger grafting models (e.g., Qwen3.5-35B-A3B) produce larger gains than weaker ones under fixed recipient architecture and budget, suggesting the benefit is not solely from added capacity. To more directly isolate the grafted content, we will add an ablation that replaces grafted hidden states with random vectors while keeping all other components fixed; this will appear in the revised §4.2. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark gains over matched baselines

full rationale

The paper describes an engineering method (offline grafting of final-token hidden states from a larger model, exact suffix lookup, lightweight adaptation, and Engram fallback) and validates it via controlled pre-training experiments that report average benchmark improvements (51.95/52.43 → 53.86 at 2.8B scale). No derivation, equation, or first-principles claim is presented that reduces by construction to fitted parameters, self-citations, or renamed inputs; results rest on external comparisons to MoE and vanilla Engram under matched architectures and budgets, rendering the work self-contained against observable performance metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method implicitly relies on the transferability of hidden states and the effectiveness of lightweight adaptation, but these are not formalized here.

pith-pipeline@v0.9.0 · 5823 in / 1328 out tokens · 27798 ms · 2026-05-21T05:46:05.346744+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 15 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conferenceon artificialintelligence, volume 34, pages 7432–7439, 2020

  3. [3]

    Exploring influencers’ and users’ experiences in douyin’s virtual reality live-streaming

    Rongyi Chen, Jingjia Xiao, Zilu Wang, Menghan Yin, Xianzhe Fan, Zihe Ran, and Qing Xiao. Exploring influencers’ and users’ experiences in douyin’s virtual reality live-streaming. InProceedings of the 30th ACM Symposium on Virtual Reality Software and Technology, pages 1–2, 2024

  4. [4]

    Whoever started the interference should end it: Guiding data-free model merging via task vectors.arXiv preprint arXiv:2503.08099, 2025

    Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors.arXiv preprint arXiv:2503.08099, 2025

  5. [5]

    Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

    Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, et al. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372, 2026

  6. [6]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), ...

  7. [7]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXivpreprint arXiv:1803.05457, 2018

  8. [8]

    Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models

    Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1280–1297, 2024

  9. [9]

    Reliability of cka as a similarity measure in deep learning.arXiv preprint arXiv:2210.16156, 2022

    MohammadRezaDavari,StefanHoroi,AmineNatik,GuillaumeLajoie,GuyWolf,andEugeneBelilovsky. Reliability of cka as a similarity measure in deep learning.arXiv preprint arXiv:2210.16156, 2022

  10. [10]

    Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

    DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

  11. [11]

    Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning.Neural networks, 107:3–11, 2018

    Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning.Neural networks, 107:3–11, 2018

  12. [12]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

  13. [13]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  14. [14]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023

  15. [15]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

  16. [16]

    Distilled pretraining: A modern lens of data, in-context learning and test-time scaling.arXiv preprint arXiv:2509.01649, 2025

    Sachin Goyal, David Lopez-Paz, and Kartik Ahuja. Distilled pretraining: A modern lens of data, in-context learning and test-time scaling.arXiv preprint arXiv:2509.01649, 2025

  17. [17]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  18. [18]

    Enhancing logits distillation with plug&play kendall’sτ ranking loss, 2025

    Yuchen Guan, Runxi Cheng, Kang Liu, and Chun Yuan. Enhancing logits distillation with plug&play kendall’sτ ranking loss, 2025

  19. [19]

    Retrieval augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020

  20. [20]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. 11 arXiv preprint arXiv:2203.15556, 10, 2022

  21. [21]

    Ultra-sparse memory network

    Zihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo, and Xun Zhou. Ultra-sparse memory network. arXiv preprint arXiv:2411.12364, 2024

  22. [22]

    Open-rag: Enhanced retrieval augmented reasoning with open-source large language models

    Shayekh Bin Islam, Md Asib Rahman, KSM Tozammel Hossain, Enamul Hoque, Shafiq Joty, and Md Rizwan Parvez. Open-rag: Enhanced retrieval augmented reasoning with open-source large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 14231–14244, 2024

  23. [23]

    Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models.arXiv preprint arXiv:2310.09949, 2023

    Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, and Gustavo Alonso. Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models.arXiv preprint arXiv:2310.09949, 2023

  24. [24]

    Mixture of lookup experts

    Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi-Hong Deng, and Yunhe Wang. Mixture of lookup experts. arXiv preprint arXiv:2503.15798, 2025

  25. [25]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Rad- ford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  26. [26]

    Improved backing-off for m-gram language modeling

    Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In1995 international conference on acoustics, speech, and signal processing, volume 1, pages 181–184. IEEE, 1995

  27. [27]

    Race: Large-scale reading comprehension dataset from examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 785–794, 2017

  28. [28]

    Large memory layers with product keys.Advances in Neural InformationProcessing Systems, 32, 2019

    Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Large memory layers with product keys.Advances in Neural InformationProcessing Systems, 32, 2019

  29. [29]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural informationprocessing systems, 33:9459–9474, 2020

  30. [30]

    Rella: Retrieval-enhanced large language models for lifelong sequential behavior comprehension in recommendation

    Jianghao Lin, Rong Shan, Chenxu Zhu, Kounianhua Du, Bo Chen, Shigang Quan, Ruiming Tang, Yong Yu, and Weinan Zhang. Rella: Retrieval-enhanced large language models for lifelong sequential behavior comprehension in recommendation. InProceedings of the ACM WebConference 2024, pages 3497–3508, 2024

  31. [31]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  32. [32]

    Infini-gram: Scalingunbounded n-gram language models to a trillion tokens.arXiv preprint arXiv:2401.17377, 2024

    JiachengLiu, SewonMin, LukeZettlemoyer, YejinChoi, andHannanehHajishirzi. Infini-gram: Scalingunbounded n-gram language models to a trillion tokens.arXiv preprint arXiv:2401.17377, 2024

  33. [33]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  34. [34]

    The lambada dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1525–...

  35. [35]

    Pre-training distillation for large language models: A design space exploration

    Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, and Juanzi Li. Pre-training distillation for large language models: A design space exploration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 3603–3618, 2025

  36. [36]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

  37. [37]

    How users who are blind or low vision play mobile games: Perceptions, challenges, and strategies

    Zihe Ran, Xiyu Li, Qing Xiao, Xianzhe Fan, Franklin Mingzhe Li, Yanyun Wang, and Zhicong Lu. How users who are blind or low vision play mobile games: Perceptions, challenges, and strategies. InProceedings of the 2025 CHI Conference on Human Factorsin Computing Systems, pages 1–18, 2025

  38. [38]

    Understanding how visually impairedplayerssocializeinmobilegames

    Zihe Ran, Xiyu Li, Qing Xiao, Yanyun Wang, Franklin Mingzhe Li, and Zhicong Lu. Understanding how visually impairedplayerssocializeinmobilegames. In Proceedingsofthe27thInternationalACMSIGACCESSConference on Computers and Accessibility, pages 1–16, 2025

  39. [39]

    Stem: Scaling transformers with embedding modules

    Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, and Beidi Chen. Stem: Scaling transformers with embedding modules. arXiv preprint arXiv:2601.10639, 2026

  40. [40]

    Winogrande: An adversarial winograd schema challenge at scale.Communicationsof the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communicationsof the ACM, 64(9):99–106, 2021

  41. [41]

    A mathematicaltheory of communication.TheBellsystemtechnicaljournal, 27(3):379– 423, 1948

    Claude Elwood Shannon. A mathematicaltheory of communication.TheBellsystemtechnicaljournal, 27(3):379– 423, 1948

  42. [42]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Out- 12 rageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  43. [43]

    Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset

    Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. InProceedings ofthe 63rd AnnualMeeting ofthe Association for Computational Linguistics (Volume1: Long Papers), pages 2459–2475, 2025

  44. [44]

    Gemma 3n

    Gemma Team. Gemma 3n. 2025

  45. [45]

    Gemma Team. Gemma 4. 2026

  46. [46]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  47. [47]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  48. [48]

    Attention is all you need.Advances in neural informationprocessing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural informationprocessing systems, 30, 2017

  49. [49]

    Optmerge: Unifying multimodal llm capabilities and modalities via model merging.arXiv preprint arXiv:2505.19892, 2025

    Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, and Dacheng Tao. Unifying multimodal large language model capabilities and modalities via model merging. arXiv preprint arXiv:2505.19892, 2025

  50. [50]

    Modeling multi-task model merging as adaptive projective gradient descent.arXiv preprint arXiv:2501.01230, 2025

    Yongxian Wei, Anke Tang, Li Shen, Zixuan Hu, Chun Yuan, and Xiaochun Cao. Modeling multi-task model merging as adaptive projective gradient descent.arXiv preprint arXiv:2501.01230, 2025

  51. [51]

    Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis

    Yongxian Wei, Yilin Zhao, Li Shen, Xinrui Chen, Runxi Cheng, Sinan Du, Hao Yu, Xiaohan Wang, Gang Liu, Jiahong Yan, et al. Learning to pose problems: Reasoning-driven and solver-adaptive data synthesis for large reasoning models.arXiv preprint arXiv:2511.09907, 2025

  52. [52]

    Multi-task model merging via adaptive weight disentanglement.arXiv preprint arXiv:2411.18729, 2024

    Feng Xiong, Runxi Cheng, Wang Chen, Zhanqiu Zhang, Yiwen Guo, Chun Yuan, and Ruifeng Xu. Multi-task model merging via adaptive weight disentanglement.arXiv preprint arXiv:2411.18729, 2024

  53. [53]

    Hs-star: Hierarchical sam- pling for self-taught reasoners via difficulty estimation and budget reallocation.arXiv preprint arXiv:2505.19866, 2025

    Feng Xiong, Hongling Xu, Yifei Wang, Runxi Cheng, Yong Wang, and Xiangxiang Chu. Hs-star: Hierarchical sam- pling for self-taught reasoners via difficulty estimation and budget reallocation.arXiv preprint arXiv:2505.19866, 2025

  54. [54]

    Hellaswag: Can a machine really finish your sentence? InProceedings ofthe 57th annualmeeting ofthe associationforcomputationallinguistics, pages 4791–4800, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings ofthe 57th annualmeeting ofthe associationforcomputationallinguistics, pages 4791–4800, 2019

  55. [55]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

  56. [56]

    Root mean square layer normalization.Advancesinneuralinformationprocessing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesinneuralinformationprocessing systems, 32, 2019

  57. [57]

    Accelerating retrieval-augmented language model serving with speculation.arXiv preprint arXiv:2401.14021, 2024

    Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, and Zhihao Jia. Accelerating retrieval-augmented language model serving with speculation.arXiv preprint arXiv:2401.14021, 2024. 13 Appendix Contents A Experiment Details 15 A.1 Model Architecture and Hyper Parameters . . . . . . . . . . . . . . . . . . . . . . . . ....