pith. machine review for the scientific record. sign in

arxiv: 2604.03044 · v2 · submitted 2026-04-03 · 💻 cs.CL · cs.AI

Recognition: no theorem link

JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

Aichen Cai, Anmeng Zhang, Anyu Li, Bohua Cai, Bo Zhang, Changjian Jiang, Changkai Lu, Chang Li, Chaocai Liang, Chao Xue, Cheng Zhang, Dongkai Liu, Fei Wang, Guoqiang Huang, Haijian Ke, Han Lin, Hao Wang, Jiacheng Zhang, Jialong Shi, Jifeng Zhu, Ji Miao, Jingjing Qian, Junhui Luo, Junwu Xiong, Lam So, Liang Huang, Ming Ke, Mingyang Li, Panfeng Shi, Peng Hao, Qian Lai, Qiaoqiao Yuan, Qingyu Yin, Qiong Cao, Qi Wang, Qixiang Wang, Rongcheng Bian, Rongduo Han, Shaoqiang Zheng, Shi Hu, Shijie Ren, Shijin Zhang, Shi Suo, Shiying Fan, Shuai Xie, Tianyi Zhang, Wei Liu, Wentao Tan, Xianghan Meng, Xiaodong He, Xing Pan, Xiran Wang, Xuyang Peng, Yang Liu, Yangyang Duan, Yanxu Chen, Ya Zhang, Yicheng Gong, Yidan Huang, Yifei Liu, Yinhao Bai, Yongqiang Liu, Yuesong Zhang, Yuqi Zhang, Zerui Xie, Zhenfang Wang, Zhennan Shen, Zheyuan Liu, Zhuwei Zeng

Pith reviewed 2026-05-13 19:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Mixture-of-ExpertsToken EfficiencyReinforcement LearningSparse ActivationLanguage ModelMulti-Token PredictionPolicy Optimization
0
0 comments X

The pith

JoyAI-LLM Flash is a 48B-parameter MoE model that activates only 2.7B parameters per forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces JoyAI-LLM Flash as an efficient Mixture-of-Experts language model in the sub-50B regime. It is pretrained on 20 trillion tokens and refined with supervised fine-tuning, direct preference optimization, and large-scale reinforcement learning. A new algorithm called FiberPO decomposes trust-region maintenance into global and local parts to stabilize policy updates at multiple scales, while the architecture balances thinking and non-thinking modes and adds dense multi-token prediction plus quantization-aware training. A sympathetic reader would care because the design promises strong performance at far lower active compute cost per token than dense models of similar total size.

Core claim

JoyAI-LLM Flash comprises 48B total parameters while activating only 2.7B parameters per forward pass, achieving a substantially higher sparsity ratio than contemporary industry leading models of comparable scale. The model is pretrained on a 20-trillion-token corpus and further optimized through a post-training pipeline that includes FiberPO, a novel RL algorithm inspired by fibration theory that decomposes trust-region maintenance into global and local components for unified multi-scale stability control. The architecture also balances thinking and non-thinking cognitive modes and incorporates joint training-inference co-design with dense multi-token prediction and quantization-aware trai

What carries the argument

FiberPO, a reinforcement-learning algorithm that splits trust-region maintenance into separate global and local components to enforce multi-scale stability during LLM policy optimization.

If this is right

  • The sparsity ratio allows the model to match or exceed performance of denser contemporaries while using far less active computation per token.
  • Joint training-inference co-design with multi-token prediction and quantization-aware training raises inference throughput.
  • Release of both the base 48B-A3B checkpoint and its post-trained variants enables direct community inspection and further adaptation.
  • Strategic balancing of thinking and non-thinking modes reduces token waste on routine steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the sparsity pattern generalizes, similar MoE designs could push total parameter counts well beyond 100B while keeping active compute fixed.
  • FiberPO's global-local split may transfer to other policy-gradient settings where stability must hold across different update horizons.
  • Lower active-parameter counts at inference time would cut energy and latency costs in production deployments of comparable capability.
  • The open checkpoints make it possible to test whether the efficiency gains survive fine-tuning on new domains.

Load-bearing premise

The combination of FiberPO, SFT, DPO, and large-scale RL produces the claimed token efficiency and performance without hidden degradations or unstated trade-offs.

What would settle it

A side-by-side benchmark run showing that JoyAI-LLM Flash consumes more tokens or active compute than claimed to reach parity with a dense 48B baseline on standard reasoning suites would disprove the efficiency advantage.

Figures

Figures reproduced from arXiv: 2604.03044 by Aichen Cai, Anmeng Zhang, Anyu Li, Bohua Cai, Bo Zhang, Changjian Jiang, Changkai Lu, Chang Li, Chaocai Liang, Chao Xue, Cheng Zhang, Dongkai Liu, Fei Wang, Guoqiang Huang, Haijian Ke, Han Lin, Hao Wang, Jiacheng Zhang, Jialong Shi, Jifeng Zhu, Ji Miao, Jingjing Qian, Junhui Luo, Junwu Xiong, Lam So, Liang Huang, Ming Ke, Mingyang Li, Panfeng Shi, Peng Hao, Qian Lai, Qiaoqiao Yuan, Qingyu Yin, Qiong Cao, Qi Wang, Qixiang Wang, Rongcheng Bian, Rongduo Han, Shaoqiang Zheng, Shi Hu, Shijie Ren, Shijin Zhang, Shi Suo, Shiying Fan, Shuai Xie, Tianyi Zhang, Wei Liu, Wentao Tan, Xianghan Meng, Xiaodong He, Xing Pan, Xiran Wang, Xuyang Peng, Yang Liu, Yangyang Duan, Yanxu Chen, Ya Zhang, Yicheng Gong, Yidan Huang, Yifei Liu, Yinhao Bai, Yongqiang Liu, Yuesong Zhang, Yuqi Zhang, Zerui Xie, Zhenfang Wang, Zhennan Shen, Zheyuan Liu, Zhuwei Zeng.

Figure 1
Figure 1. Figure 1: Model performance vs. token consumption across different middle-scale LLMs. The accuracy and token consumption averaged across eighteen benchmarks used in post-training evaluation ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Agentic trajectory synthesis pipeline Stage 3 (Mid-training). We continue training high-quality data to further refine the model. The learning rate decays from 1.4 × 10−4 to 4.2 × 10−5 . In this stage, we enable a single-layer dense Multi-Token Prediction (MTP) [8, 15] with a loss scaling factor of 0.1. Stage 4 (Context Extension). Training proceeds in two steps, retaining the same Multi-Token Prediction c… view at source ↗
Figure 4
Figure 4. Figure 4: Verifiable Environment Pipeline 3.1.1 General SFT Our general Supervised Fine-Tuning (SFT) corpus encompasses a comprehensive spectrum of domains, including mathematics, coding, tool utilization, instruction following, safety, science, Lean theorem proving, creative writing, role-playing, language and multilingual understanding. To construct this dataset, we aggregate substantial volumes of real-world and … view at source ↗
Figure 5
Figure 5. Figure 5: (a) Aggregate gate g agg (Eq. 4) with three regimes: pass-through (|x| ≤ C, slope 1), rollback (C < |x| < C ∗ := (1 + T −1 τ )C, slope −Tτ ), and zeroed (|x| ≥ C ∗ , output 0). As Tτ increases, the rollback zone narrows (width C/Tτ ) and g agg approaches a hard clip at ±C. (b) Base weight log w base τ (Eq. 2) in (log s +, log s −)-space with asymmetric thresholds. Dashed lines mark the budget boundaries C … view at source ↗
Figure 6
Figure 6. Figure 6: Single-domain RLVR on DAPO-Math-17k [72]: (a) training reward and (b) validation accuracy (AIME 2024 mean@1) vs. training step. GRPO collapses after step 60. GSPO stagnates. FiberPO improves steadily on both metrics [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training diagnostics for the single-domain DAPO math run. Top row (comparative, all three methods): (a) [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of model performance (left bars) and token consumption (right bars) across six benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of Accuracy and Throughput for Quantized Models: JoyAI-LLM Flash vs. Qwen3-30B-A3B. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Joint Optimization of MTP and Quantization (ISL/OSL = 1K/2K, Concurrency = 64) [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
read the original abstract

We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale reinforcement learning (RL) across diverse environments. To improve token efficiency, JoyAI-LLM Flash strategically balances \emph{thinking} and \emph{non-thinking} cognitive modes and introduces FiberPO, a novel RL algorithm inspired by fibration theory that decomposes trust-region maintenance into global and local components, providing unified multi-scale stability control for LLM policy optimization. To enhance architectural sparsity, the model comprises 48B total parameters while activating only 2.7B parameters per forward pass, achieving a substantially higher sparsity ratio than contemporary industry leading models of comparable scale. To further improve inference throughput, we adopt a joint training-inference co-design that incorporates dense Multi-Token Prediction (MTP) and Quantization-Aware Training (QAT). We release the checkpoints for both JoyAI-LLM-48B-A3B Base and its post-trained variants on Hugging Face to support the open-source community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces JoyAI-LLM Flash, a 48B-parameter Mixture-of-Experts model that activates only 2.7B parameters per forward pass. It is pretrained on 20 trillion tokens and post-trained via SFT, DPO, large-scale RL, and a novel FiberPO algorithm (inspired by fibration theory for multi-scale trust-region control). Additional co-design elements include dense Multi-Token Prediction and Quantization-Aware Training. The authors claim substantially higher sparsity than peer models of comparable scale and release base and post-trained checkpoints on Hugging Face.

Significance. If the sparsity ratio and efficiency gains are empirically validated, the work could advance practical mid-scale LLM deployment by showing how architectural sparsity combined with specialized RL can improve token efficiency without proportional performance loss. The open release of checkpoints would aid reproducibility.

major comments (2)
  1. [Abstract] Abstract: The headline claim of activating only 2.7B parameters out of 48B total and achieving a 'substantially higher sparsity ratio' than contemporary models (e.g., Mixtral-8x7B or DeepSeek-V2) is stated without any table, baseline active-parameter counts, or quantitative comparison, leaving the central efficiency assertion unsupported.
  2. [Abstract] Abstract: The post-training pipeline (FiberPO + SFT + DPO + large-scale RL) is described as delivering improved token efficiency and performance, yet no benchmark scores, ablation results, or delta tables versus baselines appear in the provided text, so the performance and stability claims cannot be evaluated.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'joint training-inference co-design' is used without specifying which components are jointly optimized or how the dense MTP interacts with the MoE routing, which reduces clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate quantitative support for the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim of activating only 2.7B parameters out of 48B total and achieving a 'substantially higher sparsity ratio' than contemporary models (e.g., Mixtral-8x7B or DeepSeek-V2) is stated without any table, baseline active-parameter counts, or quantitative comparison, leaving the central efficiency assertion unsupported.

    Authors: We agree that the abstract would be strengthened by explicit quantitative comparisons. In the revised manuscript we will insert a compact table (or inline values) listing total and active parameter counts for the cited baselines: Mixtral-8x7B (46.7 B total, ~12.9 B active) and DeepSeek-V2 (236 B total, 21 B active). This will directly substantiate the 48 B / 2.7 B sparsity ratio claim. revision: yes

  2. Referee: [Abstract] Abstract: The post-training pipeline (FiberPO + SFT + DPO + large-scale RL) is described as delivering improved token efficiency and performance, yet no benchmark scores, ablation results, or delta tables versus baselines appear in the provided text, so the performance and stability claims cannot be evaluated.

    Authors: We acknowledge that the abstract itself contains no numerical results. The full manuscript reports benchmark scores and ablations in Section 4; however, to make the abstract self-contained we will add a concise summary table of key metrics (MMLU, GSM8K, etc.) together with deltas versus the same baselines, plus a one-sentence reference to the FiberPO ablation findings on stability. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; claims are descriptive design statements without equation-level reductions

full rationale

The manuscript presents JoyAI-LLM Flash as a 48B-parameter MoE model activating 2.7B parameters, pretrained on 20T tokens and post-trained via SFT/DPO/RL plus the FiberPO algorithm. No equations, derivations, or fitted-parameter predictions appear in the abstract or described sections. The sparsity ratio is asserted as an architectural outcome rather than derived from self-citations, uniqueness theorems, or renamed empirical patterns. FiberPO is introduced as a novel RL method inspired by fibration theory without reducing to prior self-cited results. The central claims therefore remain self-contained descriptions of model scale and training pipeline rather than circular reductions to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all technical details remain at the level of high-level descriptions.

pith-pipeline@v0.9.0 · 5799 in / 1004 out tokens · 35906 ms · 2026-05-13T19:23:48.119762+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling

    cs.LG 2026-04 unverdicted novelty 6.0

    AdaLeZO uses a non-stationary multi-armed bandit to adaptively allocate perturbation budget across layers in zeroth-order optimization and applies inverse probability weighting to reduce variance while preserving unbi...

  2. Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

    cs.DC 2026-05 unverdicted novelty 5.0

    Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · cited by 2 Pith papers · 27 internal anchors

  1. [1]

    Ockbench: Measuring the efficiency of llm reasoning.arXiv:2511.05722, 2026

    Zheng Du, Hao Kang, Song Han, Tushar Krishna, and Ligeng Zhu. Ockbench: Measuring the efficiency of llm reasoning.arXiv:2511.05722, 2026

  2. [2]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv:2408.03314, 2024

  3. [3]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

  4. [4]

    Glm-4.5: Agentic, reasoning, and coding (arc) foundation models, 2025

    GLM Team. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models, 2025

  5. [5]

    Qwen3-30b-a3b-instruct-2507, July 2026

    Qwen Team. Qwen3-30b-a3b-instruct-2507, July 2026

  6. [6]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

  7. [7]

    Step 3.5 flash: Open frontier-level intelligence with 11b active parameters, 2026

    Ailin Huang, Ang Li, et al. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters, 2026

  8. [8]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv:2412.19437, 2024

  9. [9]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  10. [10]

    Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

  11. [11]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  12. [12]

    Language modeling with gated convolutional networks

    Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. InInternational conference on machine learning, pages 933–941. PMLR, 2017

  13. [13]

    Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,

    Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664, 2024

  14. [14]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. 6

  15. [15]

    MiMo-V2-Flash Technical Report

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

  16. [16]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  17. [17]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019

  18. [18]

    Pipedream: Fast and efficient pipeline parallel dnn training, 2018.URL https://arxiv

    Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. Pipedream: Fast and efficient pipeline parallel dnn training, 2018.URL https://arxiv. org/abs, 1806

  19. [19]

    Breadth-first pipeline parallelism.Proceedings of Machine Learning and Systems, 5:48–67, 2023

    Joel Lamy-Poirier. Breadth-first pipeline parallelism.Proceedings of Machine Learning and Systems, 5:48–67, 2023

  20. [20]

    Hanayo: Harnessing wave-like pipeline parallelism for enhanced large model training efficiency

    Ziming Liu, Shenggan Cheng, Haotian Zhou, and Yang You. Hanayo: Harnessing wave-like pipeline parallelism for enhanced large model training efficiency. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–13, 2023

  21. [21]

    Efficient large-scale language model training on gpu clusters using megatron-lm

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the international conference for high performance computing, networking, st...

  22. [22]

    Zero bubble pipeline parallelism.arXiv preprint arXiv:2401.10241, 2023

    Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. Zero bubble pipeline parallelism.arXiv preprint arXiv:2401.10241, 2023

  23. [23]

    Moe a2a interleaved 1f1b based computation and communication overlap

    NVIDIA. Moe a2a interleaved 1f1b based computation and communication overlap. https://developer.nvidia.com/zh- cn/blog/1f1b- moe- a2a- computing- overlap/, 2025. 24

  24. [24]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020

  25. [25]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020

  26. [26]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 68658–68685. Curran Associates...

  27. [27]

    Datatrove: large scale data processing, 2024

    Guilherme Penedo, Hynek Kydlíˇcek, Alessandro Cappelli, Mario Sasko, and Thomas Wolf. Datatrove: large scale data processing, 2024

  28. [28]

    Approximate nearest neighbors: towards removing the curse of dimensionality

    Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. InProceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, page 604–613, New York, NY , USA, 1998. Association for Computing Machinery

  29. [29]

    A.Z. Broder. On the resemblance and containment of documents. InProceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pages 21–29, 1997

  30. [30]

    Starcoder 2 and the stack v2: The next generation, 2024

    Anton Lozhkov and Raymond Liand others. Starcoder 2 and the stack v2: The next generation, 2024

  31. [31]

    Qwen2.5 technical report, 2025

    Qwen Team. Qwen2.5 technical report, 2025

  32. [32]

    Qwen2.5-Coder Technical Report

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. Qwen2.5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

  33. [33]

    Rewriting pre-training data boosts llm performance in math and code, 2025

    Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Ohi, Masaki Kawamura, Taishi Nakamura, Takumi Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, and Naoaki Okazaki. Rewriting pre-training data boosts llm performance in math and code, 2025

  34. [34]

    Olmo 3, 2025

    Olmo Team. Olmo 3, 2025

  35. [35]

    Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning, 2025

    NVIDIA Team. Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning, 2025

  36. [36]

    Deepseek-v3.2: Pushing the frontier of open large language models, 2025

    DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025

  37. [37]

    Mineru2.5: A decoupled vision-language model for efficient high-resolution document parsing, 2025

    Junbo Niu, Zheng Liu, et al. Mineru2.5: A decoupled vision-language model for efficient high-resolution document parsing, 2025

  38. [38]

    DeepSeek-OCR: Contexts Optical Compression

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025

  39. [39]

    Reformulation for pretraining data augmentation, 2025

    Xintong Hao, Ruijie Zhu, Ge Zhang, Ke Shen, and Chenggang Li. Reformulation for pretraining data augmentation, 2025

  40. [40]

    Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset.ArXiv, abs/2412.02595, 2024

    Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset.ArXiv, abs/2412.02595, 2024

  41. [41]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

  42. [42]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Jack W. Rae, Sebastian Borgeaud, et al. Scaling language models: Methods, analysis & insights from training gopher.CoRR, abs/2112.11446, 2021

  43. [43]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, et al. Training compute-optimal large language models.CoRR, abs/2203.15556, 2022

  44. [44]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, et al. Scaling laws for neural language models.CoRR, abs/2001.08361, 2020

  45. [45]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  46. [46]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024. 25

  47. [47]

    Cmmlu: Measuring massive multitask language understanding in chinese

    Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11260–11285, 2024

  48. [48]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  49. [49]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  50. [50]

    Evaluating Large Language Models Trained on Code

    Mark Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  51. [51]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  52. [52]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  53. [53]

    gpt-oss-120b & gpt-oss-20b model card, 2025

    OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025

  54. [54]

    Omniforce: On human-centered, large model empowered and cloud-edge collaborative automl system.nature npj-ai, 2023

    Chao Xue, Wei Liu, et al. Omniforce: On human-centered, large model empowered and cloud-edge collaborative automl system.nature npj-ai, 2023

  55. [55]

    Swe-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798, 2025

    John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents. InProceedings of the 39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025 D&B Spotlight), 2025. arXiv:2504.21798, accepted at ...

  56. [56]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents, 2024

    Xingyao Wang, Boxuan Li, et al. OpenHands: An Open Platform for AI Software Developers as Generalist Agents, 2024

  57. [57]

    SWE-agent: Agent-computer interfaces enable automated software engineering

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  58. [58]

    Openr1-math-220k dataset, 2025

    Open-R1 Team. Openr1-math-220k dataset, 2025. Accessed: 2025-03-06

  59. [59]

    Nemotron-math: Efficient long-context distillation of mathematical reasoning from multi-mode supervision.arXiv preprint arXiv:2512.15489, 2025

    Wei Du, Shubham Toshniwal, Branislav Kisacanin, Sadegh Mahdavi, Ivan Moshkov, George Armstrong, Stephen Ge, Edgar Minasyan, Feng Chen, and Igor Gitman. Nemotron-math: Efficient long-context distillation of mathematical reasoning from multi-mode supervision.arXiv preprint arXiv:2512.15489, 2025

  60. [60]

    Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

  61. [61]

    Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  62. [62]

    Measuring and narrowing the compositionality gap in language models

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023

  63. [63]

    Measuring short-form factuality in large language models

    Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models.arXiv preprint arXiv:2411.04368, 2024

  64. [64]

    Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation

    Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies...

  65. [65]

    Scholarsearch: Benchmarking scholar searching ability of llms.arXiv preprint arXiv:2506.13784, 2025

    Junting Zhou, Wang Li, Yiyan Liao, Nengyuan Zhang, Tingjia Miao, Zhihui Qi, Yuhan Wu, and Tong Yang. Scholarsearch: Benchmarking scholar searching ability of llms.arXiv preprint arXiv:2506.13784, 2025

  66. [66]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023. 26

  67. [67]

    Taskcraft: Automated generation of agentic tasks.arXiv preprint arXiv:2506.10055, 2025

    Dingfeng Shi, Jingyi Cao, Qianben Chen, Weichen Sun, Weizhen Li, Hongxuan Lu, Fangchen Dong, Tianrui Qin, King Zhu, Minghao Liu, et al. Taskcraft: Automated generation of agentic tasks.arXiv preprint arXiv:2506.10055, 2025

  68. [68]

    Nemotron-Post-Training-Dataset-v1, 2025

    Dhruv Nathawani, Igor Gitman, Somshubra Majumdar, Evelina Bakhturina, Ameya Sunil Mahabaleshwarkar, , Jian Zhang, and Jane Polak Scowcroft. Nemotron-Post-Training-Dataset-v1, 2025

  69. [69]

    Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  70. [70]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  71. [71]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  72. [72]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  73. [73]

    Fibration policy optimization.arXiv preprint arXiv:2603.08239, 2026

    Chang Li, Tshihao Tsu, Yaren Zhang, Chao Xue, and Xiaodong He. Fibration policy optimization.arXiv preprint arXiv:2603.08239, 2026

  74. [74]

    Trust region policy optimiza- tion

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimiza- tion. InInternational Conference on Machine Learning, pages 1889–1897. PMLR, 2015

  75. [75]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  76. [76]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256, 2024

  77. [77]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830, 2019

  78. [78]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

    Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36:62991–63010, 2023

  79. [79]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

  80. [80]

    Supergpqa: Scaling llm evaluation across 285 graduate disciplines, 2025

    M-A-P Team. Supergpqa: Scaling llm evaluation across 285 graduate disciplines, 2025

Showing first 80 references.