Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

Chun Yuan; Feng Xiong; Qianpu Sun; Qixiu Li; Runxi Cheng; Sinan Du; Yan Lu; Yeyun Gong; Yongxian Wei; Yuchen Guan

arxiv: 2605.20948 · v1 · pith:V4PU5NGAnew · submitted 2026-05-20 · 💻 cs.CL

Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

Runxi Cheng , Yuchen Guan , Yongxian Wei , Qianpu Sun , Qixiu Li , Sinan Du , Feng Xiong , Chun Yuan

show 2 more authors

Yan Lu Yeyun Gong

This is my paper

Pith reviewed 2026-05-21 05:46 UTC · model grok-4.3

classification 💻 cs.CL

keywords Memory Graftingconditional memoryn-gram memorylanguage model scalingexternal memoryMoEEngramoffline pre-training

0 comments

The pith

Memory Grafting reuses frozen hidden states from a grafting model as external n-gram memory to scale language model capacity with low overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Memory Grafting to make conditional memory scaling in language models more practical than learning large tables from scratch during pre-training. It runs a separate grafting model offline on frequent local n-grams, stores their final-token hidden representations as memory values, and lets the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates inside the recipient model, with a hash-based Engram fallback for unmatched contexts. Experiments under matched architectures and budgets show gains over both MoE and vanilla Engram baselines, reaching an average benchmark score of 53.86 at 2.8B scale. This positions pretrained models as reusable builders of external latent memory, expanding capacity beyond trainable parameters alone.

Core claim

Memory Grafting constructs conditional n-gram memory by running a frozen grafting model offline on frequent local n-grams, storing final-token hidden states as reusable values, and retrieving them in the recipient model via exact longest-match suffix lookup followed by adaptation through lightweight projections and gates plus a hash-based Engram fallback.

What carries the argument

Offline conditional n-gram memory built from final-token hidden states of a grafting model and retrieved by exact longest-match suffix lookup.

If this is right

External latent capacity expands with limited training and inference overhead relative to learning memory tables from scratch.
Average benchmark scores rise from 51.95 for MoE and 52.43 for vanilla Engram to 53.86 in the 2.8B-scale setting.
All grafting-model variants outperform baselines in the 0.92B-scale experiments, with larger grafting models yielding stronger gains.
Pretrained models can serve as reusable constructors of external latent memory for future scaling beyond trainable parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Combining memory banks from several grafting models could let a single recipient cover multiple specialized domains without extra training.
The method might let a small recipient model approach the performance of a much larger model by grafting memory from it.
Replacing exact suffix lookup with approximate or learned retrieval could raise coverage for rare or long contexts.

Load-bearing premise

Final-token hidden states produced by the grafting model on frequent local n-grams remain useful and transferable when retrieved via exact longest-match suffix lookup and adapted only by lightweight projections and gates inside the recipient model.

What would settle it

An ablation experiment on the same training data and recipient architecture where grafted memory retrieval is replaced by random vectors or disabled entirely, then checking whether benchmark gains over vanilla Engram disappear.

read the original abstract

Scaling conditional memory offers a promising way to increase language-model capacity, but existing methods such as Engram learn large memory tables from scratch during pre-training, making memory scaling expensive and sometimes ineffective. We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory. Given frequent local n-grams, we run the grafting model offline, store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts. Since the grafting model is only run offline and exact lookup has expected O(1) complexity with respect to memory-bank size, Memory Grafting expands external latent capacity with limited training and inference overhead. Experiments under matched recipient architectures and pre-training budgets show that Memory Grafting improves over both MoE and vanilla Engram baselines. In the 2.8B-scale setting, it improves the average benchmark score from 51.95 for MoE and 52.43 for vanilla Engram to 53.86. In the 0.92B-scale setting, all grafting-model variants improve over the baselines, with Qwen3.5-35B-A3B giving the strongest gains. These results suggest that pretrained models can serve as reusable constructors of external latent memory, providing a practical step toward scaling future language models beyond trainable parameters alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Memory Grafting reuses final-token states from a larger model as n-gram memory for a smaller one and reports modest benchmark lifts over MoE and vanilla Engram, but the gains could stem from added capacity rather than the grafted content.

read the letter

The main point is that this paper shows how to take hidden states computed offline by a big grafting model on frequent n-grams and feed them into a smaller recipient model through exact suffix lookup, with light adaptation layers and an Engram fallback. It claims this expands external memory capacity without much extra training or inference cost and delivers better average scores than the baselines under matched budgets. At 2.8B scale the average benchmark moves from 52.43 with vanilla Engram to 53.86, and at 0.92B all grafting variants beat the controls, with the largest donor model helping most. That is the concrete result worth noting. What is new is the specific offline grafting pipeline plus the exact longest-match lookup combined with the hash fallback; earlier work either learned memory tables from scratch or used different retrieval. The paper does a reasonable job laying out the procedure and running the head-to-head comparisons on the same recipient architectures and token budgets. Those controlled numbers give a practical data point for anyone thinking about external memory scaling. The soft spot is the lack of evidence that the actual pretrained hidden states are doing the heavy lifting. The method assumes those states remain useful when pulled across models via suffix lookup and adapted only by projections and gates, yet there are no ablations with random vectors, recipient self-states, or alignment metrics to isolate the content from the extra parameters and coverage. Without error bars or fuller protocol details it is also harder to judge how stable the reported edge really is. This paper is for people working on memory-augmented pretraining or efficient ways to leverage larger models as reusable components. A reader focused on practical scaling tricks would pick up usable implementation details from the grafting and fallback design. It deserves a serious referee because the core idea is straightforward, the comparisons are matched, and the overhead claims are testable even if the transfer story needs more checks.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Memory Grafting as a conditional memory scaling technique for language model pre-training. It computes final-token hidden states offline from a frozen larger grafting model on frequent n-grams, stores them as memory values, and enables retrieval in a smaller recipient model via exact longest-match suffix lookup. Retrieved states are adapted using lightweight projections and gates, with a hash-based Engram fallback for unmatched contexts. Experiments under matched architectures and budgets report benchmark gains over MoE and vanilla Engram baselines, e.g., average score rising from 51.95/52.43 to 53.86 at 2.8B scale and consistent improvements at 0.92B scale with stronger grafting models.

Significance. If the gains are attributable to the specific transferable representations from the grafting model rather than added parameters or coverage alone, the approach provides an efficient route to external latent memory that reuses pretrained models as constructors, reducing the cost of learning large memory tables from scratch and supporting capacity scaling beyond trainable parameters.

major comments (2)

[§4] §4 (Experiments): The reported benchmark improvements (e.g., 51.95 for MoE and 52.43 for vanilla Engram to 53.86) are presented without error bars, multiple random seeds, or statistical tests, leaving open the possibility that observed differences fall within training variance and weakening the empirical support for the central scaling claim.
[§3 and §4.2] §3 (Method) and §4.2 (Ablations): The key assumption that grafting-model final-token hidden states remain useful under exact suffix lookup and lightweight adaptation is load-bearing, yet no controls (e.g., random vectors, recipient self-states, or alignment metrics) or ablations isolating grafted content from the added projection/gate parameters and fallback mechanism are described; without these, gains could arise from capacity rather than the pretrained states, undermining the 'reusable constructors of external latent memory' argument.

minor comments (2)

[§3] The description of expected O(1) lookup complexity with respect to memory-bank size would benefit from an explicit statement of the hash-table implementation and worst-case behavior.
[§2] Notation for the grafting model variants (e.g., Qwen3.5-35B-A3B) and recipient scales should be introduced with a table for clarity when first mentioned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate revisions to the manuscript where appropriate.

read point-by-point responses

Referee: [§4] §4 (Experiments): The reported benchmark improvements (e.g., 51.95 for MoE and 52.43 for vanilla Engram to 53.86) are presented without error bars, multiple random seeds, or statistical tests, leaving open the possibility that observed differences fall within training variance and weakening the empirical support for the central scaling claim.

Authors: We agree that reporting results from single training runs without error bars or statistical tests is a limitation. Pre-training at the 2.8B scale under matched budgets is computationally expensive, which limited us to one run per configuration. That said, the gains appear consistently across two model scales (0.92B and 2.8B) and across grafting models of different strengths, with larger improvements from stronger grafting models. In the revised manuscript we will add a paragraph in §4 explicitly noting the single-run limitation and highlighting the cross-scale and cross-grafting-model consistency as supporting evidence for the scaling claim. revision: yes
Referee: [§3 and §4.2] §3 (Method) and §4.2 (Ablations): The key assumption that grafting-model final-token hidden states remain useful under exact suffix lookup and lightweight adaptation is load-bearing, yet no controls (e.g., random vectors, recipient self-states, or alignment metrics) or ablations isolating grafted content from the added projection/gate parameters and fallback mechanism are described; without these, gains could arise from capacity rather than the pretrained states, undermining the 'reusable constructors of external latent memory' argument.

Authors: We partially addressed the concern by comparing against the vanilla Engram baseline, which uses identical projection/gate parameters and the same hash-based fallback but learns memory values from scratch rather than grafting pretrained states. We also report that stronger grafting models (e.g., Qwen3.5-35B-A3B) produce larger gains than weaker ones under fixed recipient architecture and budget, suggesting the benefit is not solely from added capacity. To more directly isolate the grafted content, we will add an ablation that replaces grafted hidden states with random vectors while keeping all other components fixed; this will appear in the revised §4.2. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark gains over matched baselines

full rationale

The paper describes an engineering method (offline grafting of final-token hidden states from a larger model, exact suffix lookup, lightweight adaptation, and Engram fallback) and validates it via controlled pre-training experiments that report average benchmark improvements (51.95/52.43 → 53.86 at 2.8B scale). No derivation, equation, or first-principles claim is presented that reduces by construction to fitted parameters, self-citations, or renamed inputs; results rest on external comparisons to MoE and vanilla Engram under matched architectures and budgets, rendering the work self-contained against observable performance metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method implicitly relies on the transferability of hidden states and the effectiveness of lightweight adaptation, but these are not formalized here.

pith-pipeline@v0.9.0 · 5823 in / 1328 out tokens · 27798 ms · 2026-05-21T05:46:05.346744+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory... Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 15 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conferenceon artificialintelligence, volume 34, pages 7432–7439, 2020

work page 2020
[3]

Exploring influencers’ and users’ experiences in douyin’s virtual reality live-streaming

Rongyi Chen, Jingjia Xiao, Zilu Wang, Menghan Yin, Xianzhe Fan, Zihe Ran, and Qing Xiao. Exploring influencers’ and users’ experiences in douyin’s virtual reality live-streaming. InProceedings of the 30th ACM Symposium on Virtual Reality Software and Technology, pages 1–2, 2024

work page 2024
[4]

Whoever started the interference should end it: Guiding data-free model merging via task vectors.arXiv preprint arXiv:2503.08099, 2025

Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors.arXiv preprint arXiv:2503.08099, 2025

work page arXiv 2025
[5]

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, et al. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), ...

work page 2019
[7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXivpreprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1280–1297, 2024

work page 2024
[9]

Reliability of cka as a similarity measure in deep learning.arXiv preprint arXiv:2210.16156, 2022

MohammadRezaDavari,StefanHoroi,AmineNatik,GuillaumeLajoie,GuyWolf,andEugeneBelilovsky. Reliability of cka as a similarity measure in deep learning.arXiv preprint arXiv:2210.16156, 2022

work page arXiv 2022
[10]

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

work page 2024
[11]

Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning.Neural networks, 107:3–11, 2018

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning.Neural networks, 107:3–11, 2018

work page 2018
[12]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022
[13]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024
[14]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Distilled pretraining: A modern lens of data, in-context learning and test-time scaling.arXiv preprint arXiv:2509.01649, 2025

Sachin Goyal, David Lopez-Paz, and Kartik Ahuja. Distilled pretraining: A modern lens of data, in-context learning and test-time scaling.arXiv preprint arXiv:2509.01649, 2025

work page arXiv 2025
[17]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Enhancing logits distillation with plug&play kendall’sτ ranking loss, 2025

Yuchen Guan, Runxi Cheng, Kang Liu, and Chun Yuan. Enhancing logits distillation with plug&play kendall’sτ ranking loss, 2025

work page 2025
[19]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020

work page 2020
[20]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. 11 arXiv preprint arXiv:2203.15556, 10, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Ultra-sparse memory network

Zihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo, and Xun Zhou. Ultra-sparse memory network. arXiv preprint arXiv:2411.12364, 2024

work page arXiv 2024
[22]

Open-rag: Enhanced retrieval augmented reasoning with open-source large language models

Shayekh Bin Islam, Md Asib Rahman, KSM Tozammel Hossain, Enamul Hoque, Shafiq Joty, and Md Rizwan Parvez. Open-rag: Enhanced retrieval augmented reasoning with open-source large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 14231–14244, 2024

work page 2024
[23]

Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models.arXiv preprint arXiv:2310.09949, 2023

Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, and Gustavo Alonso. Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models.arXiv preprint arXiv:2310.09949, 2023

work page arXiv 2023
[24]

Mixture of lookup experts

Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi-Hong Deng, and Yunhe Wang. Mixture of lookup experts. arXiv preprint arXiv:2503.15798, 2025

work page arXiv 2025
[25]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Rad- ford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[26]

Improved backing-off for m-gram language modeling

Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In1995 international conference on acoustics, speech, and signal processing, volume 1, pages 181–184. IEEE, 1995

work page 1995
[27]

Race: Large-scale reading comprehension dataset from examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 785–794, 2017

work page 2017
[28]

Large memory layers with product keys.Advances in Neural InformationProcessing Systems, 32, 2019

Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Large memory layers with product keys.Advances in Neural InformationProcessing Systems, 32, 2019

work page 2019
[29]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural informationprocessing systems, 33:9459–9474, 2020

work page 2020
[30]

Rella: Retrieval-enhanced large language models for lifelong sequential behavior comprehension in recommendation

Jianghao Lin, Rong Shan, Chenxu Zhu, Kounianhua Du, Bo Chen, Shigang Quan, Ruiming Tang, Yong Yu, and Weinan Zhang. Rella: Retrieval-enhanced large language models for lifelong sequential behavior comprehension in recommendation. InProceedings of the ACM WebConference 2024, pages 3497–3508, 2024

work page 2024
[31]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Infini-gram: Scalingunbounded n-gram language models to a trillion tokens.arXiv preprint arXiv:2401.17377, 2024

JiachengLiu, SewonMin, LukeZettlemoyer, YejinChoi, andHannanehHajishirzi. Infini-gram: Scalingunbounded n-gram language models to a trillion tokens.arXiv preprint arXiv:2401.17377, 2024

work page arXiv 2024
[33]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

The lambada dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1525–...

work page 2016
[35]

Pre-training distillation for large language models: A design space exploration

Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, and Juanzi Li. Pre-training distillation for large language models: A design space exploration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 3603–3618, 2025

work page 2025
[36]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

work page 2026
[37]

How users who are blind or low vision play mobile games: Perceptions, challenges, and strategies

Zihe Ran, Xiyu Li, Qing Xiao, Xianzhe Fan, Franklin Mingzhe Li, Yanyun Wang, and Zhicong Lu. How users who are blind or low vision play mobile games: Perceptions, challenges, and strategies. InProceedings of the 2025 CHI Conference on Human Factorsin Computing Systems, pages 1–18, 2025

work page 2025
[38]

Understanding how visually impairedplayerssocializeinmobilegames

Zihe Ran, Xiyu Li, Qing Xiao, Yanyun Wang, Franklin Mingzhe Li, and Zhicong Lu. Understanding how visually impairedplayerssocializeinmobilegames. In Proceedingsofthe27thInternationalACMSIGACCESSConference on Computers and Accessibility, pages 1–16, 2025

work page 2025
[39]

Stem: Scaling transformers with embedding modules

Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, and Beidi Chen. Stem: Scaling transformers with embedding modules. arXiv preprint arXiv:2601.10639, 2026

work page arXiv 2026
[40]

Winogrande: An adversarial winograd schema challenge at scale.Communicationsof the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communicationsof the ACM, 64(9):99–106, 2021

work page 2021
[41]

A mathematicaltheory of communication.TheBellsystemtechnicaljournal, 27(3):379– 423, 1948

Claude Elwood Shannon. A mathematicaltheory of communication.TheBellsystemtechnicaljournal, 27(3):379– 423, 1948

work page 1948
[42]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Out- 12 rageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset

Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. InProceedings ofthe 63rd AnnualMeeting ofthe Association for Computational Linguistics (Volume1: Long Papers), pages 2459–2475, 2025

work page 2025
[44]

Gemma 3n

Gemma Team. Gemma 3n. 2025

work page 2025
[45]

Gemma Team. Gemma 4. 2026

work page 2026
[46]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Attention is all you need.Advances in neural informationprocessing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural informationprocessing systems, 30, 2017

work page 2017
[49]

Optmerge: Unifying multimodal llm capabilities and modalities via model merging.arXiv preprint arXiv:2505.19892, 2025

Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, and Dacheng Tao. Unifying multimodal large language model capabilities and modalities via model merging. arXiv preprint arXiv:2505.19892, 2025

work page arXiv 2025
[50]

Modeling multi-task model merging as adaptive projective gradient descent.arXiv preprint arXiv:2501.01230, 2025

Yongxian Wei, Anke Tang, Li Shen, Zixuan Hu, Chun Yuan, and Xiaochun Cao. Modeling multi-task model merging as adaptive projective gradient descent.arXiv preprint arXiv:2501.01230, 2025

work page arXiv 2025
[51]

Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis

Yongxian Wei, Yilin Zhao, Li Shen, Xinrui Chen, Runxi Cheng, Sinan Du, Hao Yu, Xiaohan Wang, Gang Liu, Jiahong Yan, et al. Learning to pose problems: Reasoning-driven and solver-adaptive data synthesis for large reasoning models.arXiv preprint arXiv:2511.09907, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Multi-task model merging via adaptive weight disentanglement.arXiv preprint arXiv:2411.18729, 2024

Feng Xiong, Runxi Cheng, Wang Chen, Zhanqiu Zhang, Yiwen Guo, Chun Yuan, and Ruifeng Xu. Multi-task model merging via adaptive weight disentanglement.arXiv preprint arXiv:2411.18729, 2024

work page arXiv 2024
[53]

Hs-star: Hierarchical sam- pling for self-taught reasoners via difficulty estimation and budget reallocation.arXiv preprint arXiv:2505.19866, 2025

Feng Xiong, Hongling Xu, Yifei Wang, Runxi Cheng, Yong Wang, and Xiangxiang Chu. Hs-star: Hierarchical sam- pling for self-taught reasoners via difficulty estimation and budget reallocation.arXiv preprint arXiv:2505.19866, 2025

work page arXiv 2025
[54]

Hellaswag: Can a machine really finish your sentence? InProceedings ofthe 57th annualmeeting ofthe associationforcomputationallinguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings ofthe 57th annualmeeting ofthe associationforcomputationallinguistics, pages 4791–4800, 2019

work page 2019
[55]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Root mean square layer normalization.Advancesinneuralinformationprocessing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesinneuralinformationprocessing systems, 32, 2019

work page 2019
[57]

Accelerating retrieval-augmented language model serving with speculation.arXiv preprint arXiv:2401.14021, 2024

Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, and Zhihao Jia. Accelerating retrieval-augmented language model serving with speculation.arXiv preprint arXiv:2401.14021, 2024. 13 Appendix Contents A Experiment Details 15 A.1 Model Architecture and Hyper Parameters . . . . . . . . . . . . . . . . . . . . . . . . ....

work page arXiv 2024

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conferenceon artificialintelligence, volume 34, pages 7432–7439, 2020

work page 2020

[3] [3]

Exploring influencers’ and users’ experiences in douyin’s virtual reality live-streaming

Rongyi Chen, Jingjia Xiao, Zilu Wang, Menghan Yin, Xianzhe Fan, Zihe Ran, and Qing Xiao. Exploring influencers’ and users’ experiences in douyin’s virtual reality live-streaming. InProceedings of the 30th ACM Symposium on Virtual Reality Software and Technology, pages 1–2, 2024

work page 2024

[4] [4]

Whoever started the interference should end it: Guiding data-free model merging via task vectors.arXiv preprint arXiv:2503.08099, 2025

Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors.arXiv preprint arXiv:2503.08099, 2025

work page arXiv 2025

[5] [5]

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, et al. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), ...

work page 2019

[7] [7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXivpreprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1280–1297, 2024

work page 2024

[9] [9]

Reliability of cka as a similarity measure in deep learning.arXiv preprint arXiv:2210.16156, 2022

MohammadRezaDavari,StefanHoroi,AmineNatik,GuillaumeLajoie,GuyWolf,andEugeneBelilovsky. Reliability of cka as a similarity measure in deep learning.arXiv preprint arXiv:2210.16156, 2022

work page arXiv 2022

[10] [10]

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

work page 2024

[11] [11]

Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning.Neural networks, 107:3–11, 2018

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning.Neural networks, 107:3–11, 2018

work page 2018

[12] [12]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022

[13] [13]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024

[14] [14]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Distilled pretraining: A modern lens of data, in-context learning and test-time scaling.arXiv preprint arXiv:2509.01649, 2025

Sachin Goyal, David Lopez-Paz, and Kartik Ahuja. Distilled pretraining: A modern lens of data, in-context learning and test-time scaling.arXiv preprint arXiv:2509.01649, 2025

work page arXiv 2025

[17] [17]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Enhancing logits distillation with plug&play kendall’sτ ranking loss, 2025

Yuchen Guan, Runxi Cheng, Kang Liu, and Chun Yuan. Enhancing logits distillation with plug&play kendall’sτ ranking loss, 2025

work page 2025

[19] [19]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020

work page 2020

[20] [20]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. 11 arXiv preprint arXiv:2203.15556, 10, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Ultra-sparse memory network

Zihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo, and Xun Zhou. Ultra-sparse memory network. arXiv preprint arXiv:2411.12364, 2024

work page arXiv 2024

[22] [22]

Open-rag: Enhanced retrieval augmented reasoning with open-source large language models

Shayekh Bin Islam, Md Asib Rahman, KSM Tozammel Hossain, Enamul Hoque, Shafiq Joty, and Md Rizwan Parvez. Open-rag: Enhanced retrieval augmented reasoning with open-source large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 14231–14244, 2024

work page 2024

[23] [23]

Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models.arXiv preprint arXiv:2310.09949, 2023

Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, and Gustavo Alonso. Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models.arXiv preprint arXiv:2310.09949, 2023

work page arXiv 2023

[24] [24]

Mixture of lookup experts

Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi-Hong Deng, and Yunhe Wang. Mixture of lookup experts. arXiv preprint arXiv:2503.15798, 2025

work page arXiv 2025

[25] [25]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Rad- ford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[26] [26]

Improved backing-off for m-gram language modeling

Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In1995 international conference on acoustics, speech, and signal processing, volume 1, pages 181–184. IEEE, 1995

work page 1995

[27] [27]

Race: Large-scale reading comprehension dataset from examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. InProceedings of the 2017 conference on empirical methods in natural language processing, pages 785–794, 2017

work page 2017

[28] [28]

Large memory layers with product keys.Advances in Neural InformationProcessing Systems, 32, 2019

Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Large memory layers with product keys.Advances in Neural InformationProcessing Systems, 32, 2019

work page 2019

[29] [29]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural informationprocessing systems, 33:9459–9474, 2020

work page 2020

[30] [30]

Rella: Retrieval-enhanced large language models for lifelong sequential behavior comprehension in recommendation

Jianghao Lin, Rong Shan, Chenxu Zhu, Kounianhua Du, Bo Chen, Shigang Quan, Ruiming Tang, Yong Yu, and Weinan Zhang. Rella: Retrieval-enhanced large language models for lifelong sequential behavior comprehension in recommendation. InProceedings of the ACM WebConference 2024, pages 3497–3508, 2024

work page 2024

[31] [31]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Infini-gram: Scalingunbounded n-gram language models to a trillion tokens.arXiv preprint arXiv:2401.17377, 2024

JiachengLiu, SewonMin, LukeZettlemoyer, YejinChoi, andHannanehHajishirzi. Infini-gram: Scalingunbounded n-gram language models to a trillion tokens.arXiv preprint arXiv:2401.17377, 2024

work page arXiv 2024

[33] [33]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

The lambada dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1525–...

work page 2016

[35] [35]

Pre-training distillation for large language models: A design space exploration

Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, and Juanzi Li. Pre-training distillation for large language models: A design space exploration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 3603–3618, 2025

work page 2025

[36] [36]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

work page 2026

[37] [37]

How users who are blind or low vision play mobile games: Perceptions, challenges, and strategies

Zihe Ran, Xiyu Li, Qing Xiao, Xianzhe Fan, Franklin Mingzhe Li, Yanyun Wang, and Zhicong Lu. How users who are blind or low vision play mobile games: Perceptions, challenges, and strategies. InProceedings of the 2025 CHI Conference on Human Factorsin Computing Systems, pages 1–18, 2025

work page 2025

[38] [38]

Understanding how visually impairedplayerssocializeinmobilegames

Zihe Ran, Xiyu Li, Qing Xiao, Yanyun Wang, Franklin Mingzhe Li, and Zhicong Lu. Understanding how visually impairedplayerssocializeinmobilegames. In Proceedingsofthe27thInternationalACMSIGACCESSConference on Computers and Accessibility, pages 1–16, 2025

work page 2025

[39] [39]

Stem: Scaling transformers with embedding modules

Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, and Beidi Chen. Stem: Scaling transformers with embedding modules. arXiv preprint arXiv:2601.10639, 2026

work page arXiv 2026

[40] [40]

Winogrande: An adversarial winograd schema challenge at scale.Communicationsof the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communicationsof the ACM, 64(9):99–106, 2021

work page 2021

[41] [41]

A mathematicaltheory of communication.TheBellsystemtechnicaljournal, 27(3):379– 423, 1948

Claude Elwood Shannon. A mathematicaltheory of communication.TheBellsystemtechnicaljournal, 27(3):379– 423, 1948

work page 1948

[42] [42]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Out- 12 rageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[43] [43]

Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset

Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. InProceedings ofthe 63rd AnnualMeeting ofthe Association for Computational Linguistics (Volume1: Long Papers), pages 2459–2475, 2025

work page 2025

[44] [44]

Gemma 3n

Gemma Team. Gemma 3n. 2025

work page 2025

[45] [45]

Gemma Team. Gemma 4. 2026

work page 2026

[46] [46]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Attention is all you need.Advances in neural informationprocessing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural informationprocessing systems, 30, 2017

work page 2017

[49] [49]

Optmerge: Unifying multimodal llm capabilities and modalities via model merging.arXiv preprint arXiv:2505.19892, 2025

Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, and Dacheng Tao. Unifying multimodal large language model capabilities and modalities via model merging. arXiv preprint arXiv:2505.19892, 2025

work page arXiv 2025

[50] [50]

Modeling multi-task model merging as adaptive projective gradient descent.arXiv preprint arXiv:2501.01230, 2025

Yongxian Wei, Anke Tang, Li Shen, Zixuan Hu, Chun Yuan, and Xiaochun Cao. Modeling multi-task model merging as adaptive projective gradient descent.arXiv preprint arXiv:2501.01230, 2025

work page arXiv 2025

[51] [51]

Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis

Yongxian Wei, Yilin Zhao, Li Shen, Xinrui Chen, Runxi Cheng, Sinan Du, Hao Yu, Xiaohan Wang, Gang Liu, Jiahong Yan, et al. Learning to pose problems: Reasoning-driven and solver-adaptive data synthesis for large reasoning models.arXiv preprint arXiv:2511.09907, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Multi-task model merging via adaptive weight disentanglement.arXiv preprint arXiv:2411.18729, 2024

Feng Xiong, Runxi Cheng, Wang Chen, Zhanqiu Zhang, Yiwen Guo, Chun Yuan, and Ruifeng Xu. Multi-task model merging via adaptive weight disentanglement.arXiv preprint arXiv:2411.18729, 2024

work page arXiv 2024

[53] [53]

Hs-star: Hierarchical sam- pling for self-taught reasoners via difficulty estimation and budget reallocation.arXiv preprint arXiv:2505.19866, 2025

Feng Xiong, Hongling Xu, Yifei Wang, Runxi Cheng, Yong Wang, and Xiangxiang Chu. Hs-star: Hierarchical sam- pling for self-taught reasoners via difficulty estimation and budget reallocation.arXiv preprint arXiv:2505.19866, 2025

work page arXiv 2025

[54] [54]

Hellaswag: Can a machine really finish your sentence? InProceedings ofthe 57th annualmeeting ofthe associationforcomputationallinguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings ofthe 57th annualmeeting ofthe associationforcomputationallinguistics, pages 4791–4800, 2019

work page 2019

[55] [55]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Root mean square layer normalization.Advancesinneuralinformationprocessing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesinneuralinformationprocessing systems, 32, 2019

work page 2019

[57] [57]

Accelerating retrieval-augmented language model serving with speculation.arXiv preprint arXiv:2401.14021, 2024

Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, and Zhihao Jia. Accelerating retrieval-augmented language model serving with speculation.arXiv preprint arXiv:2401.14021, 2024. 13 Appendix Contents A Experiment Details 15 A.1 Model Architecture and Hyper Parameters . . . . . . . . . . . . . . . . . . . . . . . . ....

work page arXiv 2024