Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory
Pith reviewed 2026-05-21 05:46 UTC · model grok-4.3
The pith
Memory Grafting reuses frozen hidden states from a grafting model as external n-gram memory to scale language model capacity with low overhead.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Memory Grafting constructs conditional n-gram memory by running a frozen grafting model offline on frequent local n-grams, storing final-token hidden states as reusable values, and retrieving them in the recipient model via exact longest-match suffix lookup followed by adaptation through lightweight projections and gates plus a hash-based Engram fallback.
What carries the argument
Offline conditional n-gram memory built from final-token hidden states of a grafting model and retrieved by exact longest-match suffix lookup.
If this is right
- External latent capacity expands with limited training and inference overhead relative to learning memory tables from scratch.
- Average benchmark scores rise from 51.95 for MoE and 52.43 for vanilla Engram to 53.86 in the 2.8B-scale setting.
- All grafting-model variants outperform baselines in the 0.92B-scale experiments, with larger grafting models yielding stronger gains.
- Pretrained models can serve as reusable constructors of external latent memory for future scaling beyond trainable parameters.
Where Pith is reading between the lines
- Combining memory banks from several grafting models could let a single recipient cover multiple specialized domains without extra training.
- The method might let a small recipient model approach the performance of a much larger model by grafting memory from it.
- Replacing exact suffix lookup with approximate or learned retrieval could raise coverage for rare or long contexts.
Load-bearing premise
Final-token hidden states produced by the grafting model on frequent local n-grams remain useful and transferable when retrieved via exact longest-match suffix lookup and adapted only by lightweight projections and gates inside the recipient model.
What would settle it
An ablation experiment on the same training data and recipient architecture where grafted memory retrieval is replaced by random vectors or disabled entirely, then checking whether benchmark gains over vanilla Engram disappear.
read the original abstract
Scaling conditional memory offers a promising way to increase language-model capacity, but existing methods such as Engram learn large memory tables from scratch during pre-training, making memory scaling expensive and sometimes ineffective. We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory. Given frequent local n-grams, we run the grafting model offline, store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts. Since the grafting model is only run offline and exact lookup has expected O(1) complexity with respect to memory-bank size, Memory Grafting expands external latent capacity with limited training and inference overhead. Experiments under matched recipient architectures and pre-training budgets show that Memory Grafting improves over both MoE and vanilla Engram baselines. In the 2.8B-scale setting, it improves the average benchmark score from 51.95 for MoE and 52.43 for vanilla Engram to 53.86. In the 0.92B-scale setting, all grafting-model variants improve over the baselines, with Qwen3.5-35B-A3B giving the strongest gains. These results suggest that pretrained models can serve as reusable constructors of external latent memory, providing a practical step toward scaling future language models beyond trainable parameters alone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Memory Grafting as a conditional memory scaling technique for language model pre-training. It computes final-token hidden states offline from a frozen larger grafting model on frequent n-grams, stores them as memory values, and enables retrieval in a smaller recipient model via exact longest-match suffix lookup. Retrieved states are adapted using lightweight projections and gates, with a hash-based Engram fallback for unmatched contexts. Experiments under matched architectures and budgets report benchmark gains over MoE and vanilla Engram baselines, e.g., average score rising from 51.95/52.43 to 53.86 at 2.8B scale and consistent improvements at 0.92B scale with stronger grafting models.
Significance. If the gains are attributable to the specific transferable representations from the grafting model rather than added parameters or coverage alone, the approach provides an efficient route to external latent memory that reuses pretrained models as constructors, reducing the cost of learning large memory tables from scratch and supporting capacity scaling beyond trainable parameters.
major comments (2)
- [§4] §4 (Experiments): The reported benchmark improvements (e.g., 51.95 for MoE and 52.43 for vanilla Engram to 53.86) are presented without error bars, multiple random seeds, or statistical tests, leaving open the possibility that observed differences fall within training variance and weakening the empirical support for the central scaling claim.
- [§3 and §4.2] §3 (Method) and §4.2 (Ablations): The key assumption that grafting-model final-token hidden states remain useful under exact suffix lookup and lightweight adaptation is load-bearing, yet no controls (e.g., random vectors, recipient self-states, or alignment metrics) or ablations isolating grafted content from the added projection/gate parameters and fallback mechanism are described; without these, gains could arise from capacity rather than the pretrained states, undermining the 'reusable constructors of external latent memory' argument.
minor comments (2)
- [§3] The description of expected O(1) lookup complexity with respect to memory-bank size would benefit from an explicit statement of the hash-table implementation and worst-case behavior.
- [§2] Notation for the grafting model variants (e.g., Qwen3.5-35B-A3B) and recipient scales should be introduced with a table for clarity when first mentioned.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate revisions to the manuscript where appropriate.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The reported benchmark improvements (e.g., 51.95 for MoE and 52.43 for vanilla Engram to 53.86) are presented without error bars, multiple random seeds, or statistical tests, leaving open the possibility that observed differences fall within training variance and weakening the empirical support for the central scaling claim.
Authors: We agree that reporting results from single training runs without error bars or statistical tests is a limitation. Pre-training at the 2.8B scale under matched budgets is computationally expensive, which limited us to one run per configuration. That said, the gains appear consistently across two model scales (0.92B and 2.8B) and across grafting models of different strengths, with larger improvements from stronger grafting models. In the revised manuscript we will add a paragraph in §4 explicitly noting the single-run limitation and highlighting the cross-scale and cross-grafting-model consistency as supporting evidence for the scaling claim. revision: yes
-
Referee: [§3 and §4.2] §3 (Method) and §4.2 (Ablations): The key assumption that grafting-model final-token hidden states remain useful under exact suffix lookup and lightweight adaptation is load-bearing, yet no controls (e.g., random vectors, recipient self-states, or alignment metrics) or ablations isolating grafted content from the added projection/gate parameters and fallback mechanism are described; without these, gains could arise from capacity rather than the pretrained states, undermining the 'reusable constructors of external latent memory' argument.
Authors: We partially addressed the concern by comparing against the vanilla Engram baseline, which uses identical projection/gate parameters and the same hash-based fallback but learns memory values from scratch rather than grafting pretrained states. We also report that stronger grafting models (e.g., Qwen3.5-35B-A3B) produce larger gains than weaker ones under fixed recipient architecture and budget, suggesting the benefit is not solely from added capacity. To more directly isolate the grafted content, we will add an ablation that replaces grafted hidden states with random vectors while keeping all other components fixed; this will appear in the revised §4.2. revision: partial
Circularity Check
No circularity: empirical benchmark gains over matched baselines
full rationale
The paper describes an engineering method (offline grafting of final-token hidden states from a larger model, exact suffix lookup, lightweight adaptation, and Engram fallback) and validates it via controlled pre-training experiments that report average benchmark improvements (51.95/52.43 → 53.86 at 2.8B scale). No derivation, equation, or first-principles claim is presented that reduces by construction to fitted parameters, self-citations, or renamed inputs; results rest on external comparisons to MoE and vanilla Engram under matched architectures and budgets, rendering the work self-contained against observable performance metrics.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory... Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conferenceon artificialintelligence, volume 34, pages 7432–7439, 2020
work page 2020
-
[3]
Exploring influencers’ and users’ experiences in douyin’s virtual reality live-streaming
Rongyi Chen, Jingjia Xiao, Zilu Wang, Menghan Yin, Xianzhe Fan, Zihe Ran, and Qing Xiao. Exploring influencers’ and users’ experiences in douyin’s virtual reality live-streaming. InProceedings of the 30th ACM Symposium on Virtual Reality Software and Technology, pages 1–2, 2024
work page 2024
-
[4]
Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors.arXiv preprint arXiv:2503.08099, 2025
-
[5]
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, et al. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
Boolq: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), ...
work page 2019
-
[7]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXivpreprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[8]
Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models
Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1280–1297, 2024
work page 2024
-
[9]
Reliability of cka as a similarity measure in deep learning.arXiv preprint arXiv:2210.16156, 2022
MohammadRezaDavari,StefanHoroi,AmineNatik,GuillaumeLajoie,GuyWolf,andEugeneBelilovsky. Reliability of cka as a similarity measure in deep learning.arXiv preprint arXiv:2210.16156, 2022
-
[10]
Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024
DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024
work page 2024
-
[11]
Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning.Neural networks, 107:3–11, 2018
work page 2018
-
[12]
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022
work page 2022
-
[13]
The language model evaluation harness, 07 2024
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...
work page 2024
-
[14]
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Sachin Goyal, David Lopez-Paz, and Kartik Ahuja. Distilled pretraining: A modern lens of data, in-context learning and test-time scaling.arXiv preprint arXiv:2509.01649, 2025
-
[17]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Enhancing logits distillation with plug&play kendall’sτ ranking loss, 2025
Yuchen Guan, Runxi Cheng, Kang Liu, and Chun Yuan. Enhancing logits distillation with plug&play kendall’sτ ranking loss, 2025
work page 2025
-
[19]
Retrieval augmented language model pre-training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020
work page 2020
-
[20]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. 11 arXiv preprint arXiv:2203.15556, 10, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Zihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo, and Xun Zhou. Ultra-sparse memory network. arXiv preprint arXiv:2411.12364, 2024
-
[22]
Open-rag: Enhanced retrieval augmented reasoning with open-source large language models
Shayekh Bin Islam, Md Asib Rahman, KSM Tozammel Hossain, Enamul Hoque, Shafiq Joty, and Md Rizwan Parvez. Open-rag: Enhanced retrieval augmented reasoning with open-source large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 14231–14244, 2024
work page 2024
-
[23]
Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, and Gustavo Alonso. Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models.arXiv preprint arXiv:2310.09949, 2023
-
[24]
Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi-Hong Deng, and Yunhe Wang. Mixture of lookup experts. arXiv preprint arXiv:2503.15798, 2025
-
[25]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Rad- ford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[26]
Improved backing-off for m-gram language modeling
Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In1995 international conference on acoustics, speech, and signal processing, volume 1, pages 181–184. IEEE, 1995
work page 1995
-
[27]
Race: Large-scale reading comprehension dataset from examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 785–794, 2017
work page 2017
-
[28]
Large memory layers with product keys.Advances in Neural InformationProcessing Systems, 32, 2019
Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Large memory layers with product keys.Advances in Neural InformationProcessing Systems, 32, 2019
work page 2019
-
[29]
Retrieval-augmented generation for knowledge-intensive nlp tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural informationprocessing systems, 33:9459–9474, 2020
work page 2020
-
[30]
Jianghao Lin, Rong Shan, Chenxu Zhu, Kounianhua Du, Bo Chen, Shigang Quan, Ruiming Tang, Yong Yu, and Weinan Zhang. Rella: Retrieval-enhanced large language models for lifelong sequential behavior comprehension in recommendation. InProceedings of the ACM WebConference 2024, pages 3497–3508, 2024
work page 2024
-
[31]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
JiachengLiu, SewonMin, LukeZettlemoyer, YejinChoi, andHannanehHajishirzi. Infini-gram: Scalingunbounded n-gram language models to a trillion tokens.arXiv preprint arXiv:2401.17377, 2024
-
[33]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
The lambada dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1525–...
work page 2016
-
[35]
Pre-training distillation for large language models: A design space exploration
Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, and Juanzi Li. Pre-training distillation for large language models: A design space exploration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 3603–3618, 2025
work page 2025
-
[36]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026
work page 2026
-
[37]
How users who are blind or low vision play mobile games: Perceptions, challenges, and strategies
Zihe Ran, Xiyu Li, Qing Xiao, Xianzhe Fan, Franklin Mingzhe Li, Yanyun Wang, and Zhicong Lu. How users who are blind or low vision play mobile games: Perceptions, challenges, and strategies. InProceedings of the 2025 CHI Conference on Human Factorsin Computing Systems, pages 1–18, 2025
work page 2025
-
[38]
Understanding how visually impairedplayerssocializeinmobilegames
Zihe Ran, Xiyu Li, Qing Xiao, Yanyun Wang, Franklin Mingzhe Li, and Zhicong Lu. Understanding how visually impairedplayerssocializeinmobilegames. In Proceedingsofthe27thInternationalACMSIGACCESSConference on Computers and Accessibility, pages 1–16, 2025
work page 2025
-
[39]
Stem: Scaling transformers with embedding modules
Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, and Beidi Chen. Stem: Scaling transformers with embedding modules. arXiv preprint arXiv:2601.10639, 2026
-
[40]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communicationsof the ACM, 64(9):99–106, 2021
work page 2021
-
[41]
A mathematicaltheory of communication.TheBellsystemtechnicaljournal, 27(3):379– 423, 1948
Claude Elwood Shannon. A mathematicaltheory of communication.TheBellsystemtechnicaljournal, 27(3):379– 423, 1948
work page 1948
-
[42]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Out- 12 rageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[43]
Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset
Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. InProceedings ofthe 63rd AnnualMeeting ofthe Association for Computational Linguistics (Volume1: Long Papers), pages 2459–2475, 2025
work page 2025
- [44]
-
[45]
Gemma Team. Gemma 4. 2026
work page 2026
-
[46]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Attention is all you need.Advances in neural informationprocessing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural informationprocessing systems, 30, 2017
work page 2017
-
[49]
Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, and Dacheng Tao. Unifying multimodal large language model capabilities and modalities via model merging. arXiv preprint arXiv:2505.19892, 2025
-
[50]
Yongxian Wei, Anke Tang, Li Shen, Zixuan Hu, Chun Yuan, and Xiaochun Cao. Modeling multi-task model merging as adaptive projective gradient descent.arXiv preprint arXiv:2501.01230, 2025
-
[51]
Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis
Yongxian Wei, Yilin Zhao, Li Shen, Xinrui Chen, Runxi Cheng, Sinan Du, Hao Yu, Xiaohan Wang, Gang Liu, Jiahong Yan, et al. Learning to pose problems: Reasoning-driven and solver-adaptive data synthesis for large reasoning models.arXiv preprint arXiv:2511.09907, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Multi-task model merging via adaptive weight disentanglement.arXiv preprint arXiv:2411.18729, 2024
Feng Xiong, Runxi Cheng, Wang Chen, Zhanqiu Zhang, Yiwen Guo, Chun Yuan, and Ruifeng Xu. Multi-task model merging via adaptive weight disentanglement.arXiv preprint arXiv:2411.18729, 2024
-
[53]
Feng Xiong, Hongling Xu, Yifei Wang, Runxi Cheng, Yong Wang, and Xiangxiang Chu. Hs-star: Hierarchical sam- pling for self-taught reasoners via difficulty estimation and budget reallocation.arXiv preprint arXiv:2505.19866, 2025
-
[54]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings ofthe 57th annualmeeting ofthe associationforcomputationallinguistics, pages 4791–4800, 2019
work page 2019
-
[55]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Root mean square layer normalization.Advancesinneuralinformationprocessing systems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesinneuralinformationprocessing systems, 32, 2019
work page 2019
-
[57]
Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, and Zhihao Jia. Accelerating retrieval-augmented language model serving with speculation.arXiv preprint arXiv:2401.14021, 2024. 13 Appendix Contents A Experiment Details 15 A.1 Model Architecture and Hyper Parameters . . . . . . . . . . . . . . . . . . . . . . . . ....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.