Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

Hantao Huang; Hao Zeng; Ian En-Hsu Yen; Jinyuan Shi; Minghui Yu; Ruixuan Huang; Yifan Huang; Ziyi Guan

arxiv: 2606.10722 · v1 · pith:QH7BB27Anew · submitted 2026-06-09 · 💻 cs.CL

Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

Ruixuan Huang , Jinyuan Shi , Hantao Huang , Yifan Huang , Ziyi Guan , Hao Zeng , Ian En-Hsu Yen , Minghui Yu This is my paper

Pith reviewed 2026-06-27 13:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords continual trainingsparse LLMschannel sparsityFFN routingdense-to-sparse upcyclingpredictor-gated sparsitylong-context repair

0 comments

The pith

Continual training with a predictor-gated bank-wise sparse FFN upcycles a dense Qwen2.5-8B into a 4x channel-sparse LLM at 32K context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to turn an existing dense large language model into a channel-sparse version by continuing its training rather than starting over. A low-rank predictor learns to route tokens through only 16 out of every 64 FFN channels using a bank-wise top-k rule, producing 4x sparsity in the intermediate activations while the whole system stays on the main language modeling objective. A reader would care because this lets practitioners reuse dense checkpoints to build models that match hardware preferences for sparse execution without a full retraining cycle from random weights. The approach also fixes a layer-specific long-context problem that appears during this process.

Core claim

Placing a low-rank predictor on the main training path to generate per-token, per-layer FFN routing logits, then enforcing bank-wise top-k selection of 16 channels out of 64 during the 32K continual training stage, converts the dense backbone into a hardware-oriented sparse model while maintaining downstream performance.

What carries the argument

The predictor-gated bank-wise sparse SwiGLU FFN, in which a low-rank module produces routing logits and a fixed bank-wise top-k rule keeps exactly 16 channels active per 64-channel bank for each token and layer.

If this is right

The resulting model delivers 4x reduction in FFN intermediate activations while remaining trainable from a dense checkpoint.
A single-layer repair algorithm corrects the observed layer-local failure on RULER-CWE long-context tasks.
The training recipe includes explicit lessons on predictor placement and sparsity scheduling that apply to similar upcycling runs.
The sparse architecture is directly compatible with hardware accelerators that exploit channel-level sparsity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same predictor-plus-bank-wise rule might be applied to other linear layers beyond the FFN without changing the overall training loop.
Inference latency on sparsity-aware accelerators could be measured directly to quantify the hardware benefit beyond the reported training metrics.
Extending the bank size or predictor rank would test whether the current 64-to-16 ratio is optimal or merely convenient.

Load-bearing premise

Jointly optimizing the routing predictor with the base model during continual training will discover channel selections that keep downstream performance close to the dense baseline at 4x sparsity.

What would settle it

If the final sparse model scores substantially lower than a dense model continued under identical 32K training conditions on the same benchmarks, the joint-optimization premise does not hold.

Figures

Figures reproduced from arXiv: 2606.10722 by Hantao Huang, Hao Zeng, Ian En-Hsu Yen, Jinyuan Shi, Minghui Yu, Ruixuan Huang, Yifan Huang, Ziyi Guan.

**Figure 2.** Figure 2: Training loss around the sparse transition in the 32K stage. Colored bands denote predictor [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Channel-balance ablations during 32K continual training. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: RULER-CWE average recall@10 as context length increases. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Continual-training ablation comparing predictor before top- [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Layer-7 final-token repair shows a trade-off between long-CWE recovery and MMLU. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

We study dense-to-sparse continual training as a way to construct channel-sparse large language models from dense checkpoints. Starting from a Qwen2.5-8B dense backbone, we continue training at 32K context and introduce a predictor-gated sparse SwiGLU FFN in the 32K stage. For each token and layer, we use a low-rank predictor to produce FFN-channel routing logits. We then apply a bank-wise top-k rule to retain 16 channels in every 64-channel bank, yielding 4x sparsity in the FFN intermediate activation. Unlike post-hoc sparse inference methods, the routing module is placed on the main language modeling path and optimized during continual training, enabling the dense model to be upcycled into a hardware-oriented sparse model. We report the architecture, training recipe, benchmark performance, and training lessons. We also identify a layer-local long-context failure mode on RULER-CWE and propose a single-layer repair algorithm that substantially improves the affected length range.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete recipe for upcycling a dense 8B model to 4x bank-wise sparse FFN via jointly trained low-rank predictor during 32K continual training, plus a single-layer long-context fix.

read the letter

The main new element is the training recipe that adds a low-rank predictor to the main path, produces routing logits per token and layer, then applies bank-wise top-k (16 out of 64 channels) to reach 4x sparsity in the SwiGLU FFN while continuing training from Qwen2.5-8B at 32K context. They also flag a layer-local failure on RULER-CWE and give a single-layer repair that improves the affected range.

This approach is practical because it starts from an existing dense checkpoint instead of training sparse from scratch, and the joint optimization lets the model adapt to the sparsity pattern rather than applying it after the fact. The bank-wise rule is hardware-friendly, which matches real deployment needs.

The description of the architecture and the training stage is straightforward. Reporting the full recipe, benchmarks, and lessons learned is the right move for this kind of work.

The soft spot is that the abstract supplies no numbers, ablations, or error bars, so it is impossible to tell whether the learned routing actually carries the performance or whether the continual training alone would have been enough. The hard top-k selection raises a real question about gradient flow to the predictor; the text does not mention a straight-through estimator or auxiliary loss, which leaves that link unverified from what is shown.

This is for readers who build or deploy efficient LLMs and want a worked example of dense-to-sparse upcycling. It deserves peer review because the recipe is specific and the long-context repair is a useful observation, even if the results need checking.

Referee Report

1 major / 2 minor

Summary. The paper claims that dense-to-sparse continual training can upcycle a Qwen2.5-8B checkpoint into a hardware-oriented 4x channel-sparse LLM. Starting from the dense backbone, the authors continue training at 32K context while replacing the SwiGLU FFN with a predictor-gated version: a low-rank predictor produces per-token, per-layer routing logits, a bank-wise top-k rule keeps 16 of 64 channels per bank, and the predictor is placed on the main LM path and jointly optimized. They also report a layer-local long-context failure mode on RULER-CWE together with a single-layer repair algorithm.

Significance. If the joint optimization of the low-rank predictor with the bank-wise top-k selection succeeds in preserving downstream performance, the recipe would supply a concrete, training-based route from dense checkpoints to sparse models that is more hardware-friendly than post-hoc pruning. The explicit identification and repair of the RULER-CWE length-range failure is a secondary but useful empirical contribution.

major comments (1)

[Abstract / §3] Abstract / §3 (Training Recipe): the central claim requires that joint optimization of the low-rank predictor with the discrete bank-wise top-k (16/64) produces effective routing. No relaxation, straight-through estimator, or auxiliary loss is described for propagating gradients through the hard selection; without such a mechanism the gradients to the predictor are undefined or noisy, undermining the assertion that the learned routing itself is responsible for retained performance.

minor comments (2)

The abstract states that benchmark performance and training lessons are reported, yet no tables, numbers, or ablation controls appear in the provided text; these must be added with error bars and dense/sparse comparisons.
Notation for the bank-wise top-k operation and the low-rank predictor dimensions should be formalized with equations rather than prose only.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the critical issue of gradient flow through the discrete bank-wise top-k selection. This is a valid point regarding the training recipe description. We address it directly below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / §3] Abstract / §3 (Training Recipe): the central claim requires that joint optimization of the low-rank predictor with the discrete bank-wise top-k (16/64) produces effective routing. No relaxation, straight-through estimator, or auxiliary loss is described for propagating gradients through the hard selection; without such a mechanism the gradients to the predictor are undefined or noisy, undermining the assertion that the learned routing itself is responsible for retained performance.

Authors: We agree that the current manuscript text in §3 does not explicitly describe the mechanism for back-propagating through the hard bank-wise top-k (16/64) selection. In the implementation, gradients were propagated using the straight-through estimator (STE), with the forward pass applying the discrete top-k and the backward pass treating the selection as the identity function. No auxiliary loss was used. This detail was omitted from the description of the training recipe. We will revise §3 (and the abstract if space permits) to state that the low-rank predictor is jointly optimized via STE through the discrete selection, thereby clarifying how the routing logits receive gradients and supporting the claim that the learned routing contributes to performance retention. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description outline an empirical continual-training recipe that introduces a low-rank predictor for routing logits followed by bank-wise top-k selection during the 32K stage. No equations, fitted parameters, or derivations are shown that reduce by construction to their own inputs (e.g., no self-definitional scaling, no prediction of a quantity used in its own fit). No self-citations, uniqueness theorems, or ansatzes are invoked in the given text. The central claim concerns the outcome of joint optimization on downstream performance, which is an empirical assertion rather than a closed mathematical reduction. This matches the default expectation for non-circular papers; the method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies insufficient detail to enumerate free parameters, axioms, or invented entities; the central claim implicitly rests on the unexamined premise that a jointly trained low-rank predictor can discover useful routing without explicit regularization details being stated.

pith-pipeline@v0.9.1-grok · 5743 in / 1247 out tokens · 29745 ms · 2026-06-27T13:02:29.669825+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 1 canonical work pages

[1]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference...

2017
[2]

GLU variants improve transformer.ArXiv preprint, abs/2002.05202, 2020

Noam Shazeer. GLU variants improve transformer.ArXiv preprint, abs/2002.05202, 2020. URLhttps://arxiv.org/abs/2002.05202

Pith/arXiv arXiv 2002
[3]

Deja vu: Contextual sparsity for efficient llms at inference time

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shri- vastava, Ce Zhang, Yuandong Tian, Christopher Ré, and Beidi Chen. Deja vu: Contextual sparsity for efficient llms at inference time. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Confer- e...

2023
[4]

Gshard: Scaling giant models with con- ditional computation and automatic sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with con- ditional computation and automatic sharding. In9th International Conference on Learning 9 Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URLhtt...

2021
[5]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

2022
[6]

AbouElhamayed, Jordan Dotzel, Zhiru Zhang, Alexander M

Yash Akhauri, Ahmed F. AbouElhamayed, Jordan Dotzel, Zhiru Zhang, Alexander M. Rush, Safeen Huda, and Mohamed S. Abdelfattah. Shadowllm: Predictor-based contextual sparsity for large language models.ArXiv preprint, abs/2406.16635, 2024. URL https://arxiv.org/ abs/2406.16635

arXiv 2024
[7]

Sirius: Contextual sparsity with correction for efficient llms.ArXiv preprint, abs/2409.03856, 2024

Yang Zhou, Zhuoming Chen, Zhaozhuo Xu, Victoria Lin, and Beidi Chen. Sirius: Contextual sparsity with correction for efficient llms.ArXiv preprint, abs/2409.03856, 2024. URL https: //arxiv.org/abs/2409.03856

arXiv 2024
[8]

Mixture-of-channels: Exploiting sparse ffns for efficient llms pre-training and inference.ArXiv preprint, abs/2511.09323, 2025

Tong Wu, Yutong He, Bin Wang, and Kun Yuan. Mixture-of-channels: Exploiting sparse ffns for efficient llms pre-training and inference.ArXiv preprint, abs/2511.09323, 2025. URL https://arxiv.org/abs/2511.09323

arXiv 2025
[9]

Qwen2.5 technical report.ArXiv preprint, abs/2412.15115, 2024

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.ArXiv preprint, abs/2412.15115, 2024. URLhttps://arxiv.org/abs/2412.15115

Pith/arXiv arXiv 2024
[10]

Sparse upcycling: Training mixture-of-experts from dense checkpoints

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URLht...

2023
[11]

Ruler: What’s the real context size of your long-context language models? In First Conference on Language Modeling, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models? In First Conference on Language Modeling, 2024

2024
[12]

Willcock, Suvinay Subramanian, Felix Chern, Alek Andreev, Shreya Pathak, Felix Yu, Prateek Jain, David E

Chong You, Kan Wu, Zhipeng Jia, Lin Chen, Srinadh Bhojanapalli, Jiaxian Guo, Utku Evci, Jan Wassenberg, Praneeth Netrapalli, Jeremiah J. Willcock, Suvinay Subramanian, Felix Chern, Alek Andreev, Shreya Pathak, Felix Yu, Prateek Jain, David E. Culler, Henry M. Levy, and Sanjiv Kumar. Spark transformer: Reactivating sparsity in ffn and attention.ArXiv prepr...

arXiv 2025
[13]

MoEfication: Transformer feed-forward layers are mixtures of experts

Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. MoEfication: Transformer feed-forward layers are mixtures of experts. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Findings of the Association for Computational Linguistics: ACL 2022, pages 877–890, Dublin, Ireland, 2022. Association for Computational Ling...

2022
[14]

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. ArXiv preprint, abs/2411.04368, 2024. URLhttps://arxiv.org/abs/2411.04368

Pith/arXiv arXiv 2024
[15]

Chinese SimpleQA: A chinese factuality evaluation for large language models.ArXiv preprint, abs/2411.07140, 2024

Yancheng He, Shilong Li, Jiaheng Liu, Yingshui Tan, Weixun Wang, Hui Huang, Xingyuan Bu, Hangyu Guo, Chengwei Hu, Boren Zheng, Zhuoran Lin, Xuepeng Liu, Dekai Sun, Shirong Lin, Zhicheng Zheng, Xiaoyong Zhu, Wenbo Su, and Bo Zheng. Chinese SimpleQA: A chinese factuality evaluation for large language models.ArXiv preprint, abs/2411.07140, 2024. URL https://...

arXiv 2024
[16]

Le, and Charles Sutton

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.ArXiv preprint, abs/2108.07732, 2021. URL https: //arxiv.org/abs/2108.07732. 10

Pith/arXiv arXiv 2021
[17]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Ne...

2023
[18]

Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. MultiPL-E: A scalable and extensible approach to benchmarking neural code generation.IEEE Transactions on Software Engineering, 2023

2023
[19]

Multi-lingual evaluation of code generation models

Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Gi- aquinto, Haifeng Qian, Murali Krishna Ramanathan, and Ramesh Nallapati. Multi-lingual evaluation of cod...

2023
[20]

LiveCodeBench: Holistic and contami- nation free evaluation of large language models for code.ArXiv preprint, abs/2403.07974, 2024

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contami- nation free evaluation of large language models for code.ArXiv preprint, abs/2403.07974, 2024. URLhttps://arxiv.org/abs/2403.07974

Pith/arXiv arXiv 2024
[21]

KOR-Bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.ArXiv preprint, abs/2410.06526, 2024

Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, Wenhao Huang, and Ge Zhang. KOR-Bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.ArXiv preprint, abs/2410.06526, 2024. URLhttps://arxiv.org/abs/2410.06526

arXiv 2024
[22]

ZebraLogic: On the scaling limits of LLMs for logical reasoning.ArXiv preprint, abs/2502.01100, 2025

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. ZebraLogic: On the scaling limits of LLMs for logical reasoning.ArXiv preprint, abs/2502.01100, 2025. URLhttps://arxiv.org/abs/2502.01100

arXiv 2025
[23]

ProcBench: Benchmark for multi-step reasoning and following procedure.ArXiv preprint, abs/2410.03117, 2024

Ippei Fujisawa, Sensho Nobe, Hiroki Seto, Rina Onda, Yoshiaki Uchida, Hiroki Ikoma, Pei- Chun Chien, and Ryota Kanai. ProcBench: Benchmark for multi-step reasoning and following procedure.ArXiv preprint, abs/2410.03117, 2024. URL https://arxiv.org/abs/2410. 03117

arXiv 2024
[24]

On the measure of intelligence.ArXiv preprint, abs/1911.01547, 2019

François Chollet. On the measure of intelligence.ArXiv preprint, abs/1911.01547, 2019. URL https://arxiv.org/abs/1911.01547

Pith/arXiv arXiv 1911
[25]

DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu...

2019
[26]

Challenging BIG - Bench Tasks and Whether Chain -of- Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and Jason Wei. Chal- lenging BIG-bench tasks and whether chain-of-thought can solve them. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Com- putational Linguistics:...

work page doi:10.18653/v1/2023.findings-acl.824 2023
[27]

American invitational mathematics examination (AIME)

Mathematical Association of America. American invitational mathematics examination (AIME). https://maa.org/student-programs/amc/, 2025. 11

2025
[28]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems, volume 34, 2021

2021
[29]

LiveBench: A challenging, contamination- free LLM benchmark.ArXiv preprint, abs/2406.19314, 2024

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. LiveBench: A challenging, contamination- free LLM benchmark.ArXiv preprint, abs/2406.19314, 2024. URL https://arxiv.org/ abs/2406.19314

Pith/arXiv arXiv 2024
[30]

SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines

M-A-P Team et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines. ArXiv preprint, abs/2502.14739, 2025. URLhttps://arxiv.org/abs/2502.14739

Pith/arXiv arXiv 2025
[31]

AGIEval: A human-centric benchmark for evaluating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evaluating foundation models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, Mexico City, Mexico, 2024. Ass...

2024
[32]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela ...

2024
[33]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Jun- teng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Al- ice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editor...

2023
[34]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

2021
[35]

LLMTest_NeedleInAHaystack: Doing simple retrieval from LLM models at various context lengths to measure accuracy

Greg Kamradt. LLMTest_NeedleInAHaystack: Doing simple retrieval from LLM models at various context lengths to measure accuracy. https://github.com/gkamradt/LLMTest_ NeedleInAHaystack, 2023. GitHub repository. 12 Appendix A Predictor Top-k Ablation This section records the diagnostic ablation that motivates the predictor-before-top-k sparse-routing order. ...

2023

[1] [1]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference...

2017

[2] [2]

GLU variants improve transformer.ArXiv preprint, abs/2002.05202, 2020

Noam Shazeer. GLU variants improve transformer.ArXiv preprint, abs/2002.05202, 2020. URLhttps://arxiv.org/abs/2002.05202

Pith/arXiv arXiv 2002

[3] [3]

Deja vu: Contextual sparsity for efficient llms at inference time

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shri- vastava, Ce Zhang, Yuandong Tian, Christopher Ré, and Beidi Chen. Deja vu: Contextual sparsity for efficient llms at inference time. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Confer- e...

2023

[4] [4]

Gshard: Scaling giant models with con- ditional computation and automatic sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with con- ditional computation and automatic sharding. In9th International Conference on Learning 9 Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URLhtt...

2021

[5] [5]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

2022

[6] [6]

AbouElhamayed, Jordan Dotzel, Zhiru Zhang, Alexander M

Yash Akhauri, Ahmed F. AbouElhamayed, Jordan Dotzel, Zhiru Zhang, Alexander M. Rush, Safeen Huda, and Mohamed S. Abdelfattah. Shadowllm: Predictor-based contextual sparsity for large language models.ArXiv preprint, abs/2406.16635, 2024. URL https://arxiv.org/ abs/2406.16635

arXiv 2024

[7] [7]

Sirius: Contextual sparsity with correction for efficient llms.ArXiv preprint, abs/2409.03856, 2024

Yang Zhou, Zhuoming Chen, Zhaozhuo Xu, Victoria Lin, and Beidi Chen. Sirius: Contextual sparsity with correction for efficient llms.ArXiv preprint, abs/2409.03856, 2024. URL https: //arxiv.org/abs/2409.03856

arXiv 2024

[8] [8]

Mixture-of-channels: Exploiting sparse ffns for efficient llms pre-training and inference.ArXiv preprint, abs/2511.09323, 2025

Tong Wu, Yutong He, Bin Wang, and Kun Yuan. Mixture-of-channels: Exploiting sparse ffns for efficient llms pre-training and inference.ArXiv preprint, abs/2511.09323, 2025. URL https://arxiv.org/abs/2511.09323

arXiv 2025

[9] [9]

Qwen2.5 technical report.ArXiv preprint, abs/2412.15115, 2024

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.ArXiv preprint, abs/2412.15115, 2024. URLhttps://arxiv.org/abs/2412.15115

Pith/arXiv arXiv 2024

[10] [10]

Sparse upcycling: Training mixture-of-experts from dense checkpoints

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URLht...

2023

[11] [11]

Ruler: What’s the real context size of your long-context language models? In First Conference on Language Modeling, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models? In First Conference on Language Modeling, 2024

2024

[12] [12]

Willcock, Suvinay Subramanian, Felix Chern, Alek Andreev, Shreya Pathak, Felix Yu, Prateek Jain, David E

Chong You, Kan Wu, Zhipeng Jia, Lin Chen, Srinadh Bhojanapalli, Jiaxian Guo, Utku Evci, Jan Wassenberg, Praneeth Netrapalli, Jeremiah J. Willcock, Suvinay Subramanian, Felix Chern, Alek Andreev, Shreya Pathak, Felix Yu, Prateek Jain, David E. Culler, Henry M. Levy, and Sanjiv Kumar. Spark transformer: Reactivating sparsity in ffn and attention.ArXiv prepr...

arXiv 2025

[13] [13]

MoEfication: Transformer feed-forward layers are mixtures of experts

Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. MoEfication: Transformer feed-forward layers are mixtures of experts. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Findings of the Association for Computational Linguistics: ACL 2022, pages 877–890, Dublin, Ireland, 2022. Association for Computational Ling...

2022

[14] [14]

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. ArXiv preprint, abs/2411.04368, 2024. URLhttps://arxiv.org/abs/2411.04368

Pith/arXiv arXiv 2024

[15] [15]

Chinese SimpleQA: A chinese factuality evaluation for large language models.ArXiv preprint, abs/2411.07140, 2024

Yancheng He, Shilong Li, Jiaheng Liu, Yingshui Tan, Weixun Wang, Hui Huang, Xingyuan Bu, Hangyu Guo, Chengwei Hu, Boren Zheng, Zhuoran Lin, Xuepeng Liu, Dekai Sun, Shirong Lin, Zhicheng Zheng, Xiaoyong Zhu, Wenbo Su, and Bo Zheng. Chinese SimpleQA: A chinese factuality evaluation for large language models.ArXiv preprint, abs/2411.07140, 2024. URL https://...

arXiv 2024

[16] [16]

Le, and Charles Sutton

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.ArXiv preprint, abs/2108.07732, 2021. URL https: //arxiv.org/abs/2108.07732. 10

Pith/arXiv arXiv 2021

[17] [17]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Ne...

2023

[18] [18]

Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. MultiPL-E: A scalable and extensible approach to benchmarking neural code generation.IEEE Transactions on Software Engineering, 2023

2023

[19] [19]

Multi-lingual evaluation of code generation models

Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Gi- aquinto, Haifeng Qian, Murali Krishna Ramanathan, and Ramesh Nallapati. Multi-lingual evaluation of cod...

2023

[20] [20]

LiveCodeBench: Holistic and contami- nation free evaluation of large language models for code.ArXiv preprint, abs/2403.07974, 2024

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contami- nation free evaluation of large language models for code.ArXiv preprint, abs/2403.07974, 2024. URLhttps://arxiv.org/abs/2403.07974

Pith/arXiv arXiv 2024

[21] [21]

KOR-Bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.ArXiv preprint, abs/2410.06526, 2024

Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, Wenhao Huang, and Ge Zhang. KOR-Bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.ArXiv preprint, abs/2410.06526, 2024. URLhttps://arxiv.org/abs/2410.06526

arXiv 2024

[22] [22]

ZebraLogic: On the scaling limits of LLMs for logical reasoning.ArXiv preprint, abs/2502.01100, 2025

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. ZebraLogic: On the scaling limits of LLMs for logical reasoning.ArXiv preprint, abs/2502.01100, 2025. URLhttps://arxiv.org/abs/2502.01100

arXiv 2025

[23] [23]

ProcBench: Benchmark for multi-step reasoning and following procedure.ArXiv preprint, abs/2410.03117, 2024

Ippei Fujisawa, Sensho Nobe, Hiroki Seto, Rina Onda, Yoshiaki Uchida, Hiroki Ikoma, Pei- Chun Chien, and Ryota Kanai. ProcBench: Benchmark for multi-step reasoning and following procedure.ArXiv preprint, abs/2410.03117, 2024. URL https://arxiv.org/abs/2410. 03117

arXiv 2024

[24] [24]

On the measure of intelligence.ArXiv preprint, abs/1911.01547, 2019

François Chollet. On the measure of intelligence.ArXiv preprint, abs/1911.01547, 2019. URL https://arxiv.org/abs/1911.01547

Pith/arXiv arXiv 1911

[25] [25]

DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu...

2019

[26] [26]

Challenging BIG - Bench Tasks and Whether Chain -of- Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and Jason Wei. Chal- lenging BIG-bench tasks and whether chain-of-thought can solve them. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Com- putational Linguistics:...

work page doi:10.18653/v1/2023.findings-acl.824 2023

[27] [27]

American invitational mathematics examination (AIME)

Mathematical Association of America. American invitational mathematics examination (AIME). https://maa.org/student-programs/amc/, 2025. 11

2025

[28] [28]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems, volume 34, 2021

2021

[29] [29]

LiveBench: A challenging, contamination- free LLM benchmark.ArXiv preprint, abs/2406.19314, 2024

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. LiveBench: A challenging, contamination- free LLM benchmark.ArXiv preprint, abs/2406.19314, 2024. URL https://arxiv.org/ abs/2406.19314

Pith/arXiv arXiv 2024

[30] [30]

SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines

M-A-P Team et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines. ArXiv preprint, abs/2502.14739, 2025. URLhttps://arxiv.org/abs/2502.14739

Pith/arXiv arXiv 2025

[31] [31]

AGIEval: A human-centric benchmark for evaluating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evaluating foundation models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, Mexico City, Mexico, 2024. Ass...

2024

[32] [32]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela ...

2024

[33] [33]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Jun- teng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Al- ice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editor...

2023

[34] [34]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

2021

[35] [35]

LLMTest_NeedleInAHaystack: Doing simple retrieval from LLM models at various context lengths to measure accuracy

Greg Kamradt. LLMTest_NeedleInAHaystack: Doing simple retrieval from LLM models at various context lengths to measure accuracy. https://github.com/gkamradt/LLMTest_ NeedleInAHaystack, 2023. GitHub repository. 12 Appendix A Predictor Top-k Ablation This section records the diagnostic ablation that motivates the predictor-before-top-k sparse-routing order. ...

2023