Recognition: 2 theorem links
· Lean TheoremMiMo-V2-Flash Technical Report
Pith reviewed 2026-05-12 11:27 UTC · model grok-4.3
The pith
MiMo-V2-Flash matches top open-weight models like DeepSeek-V3.2 using half their total parameters via sparse MoE design.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MiMo-V2-Flash is a Mixture-of-Experts model with 309B total parameters and 15B active parameters that rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2 despite using only half or one-third their total parameters. It adopts a hybrid attention architecture interleaving Sliding Window Attention with global attention under a 5:1 ratio and a 128-token window, pre-trains with Multi-Token Prediction on 27 trillion tokens, and introduces Multi-Teacher On-Policy Distillation where domain-specialized teachers provide dense token-level rewards. The model extends to 256k context and repurposes MTP layers for speculative decoding to reach up to 3.6 acceptance length and 2.6x speedup,开放
What carries the argument
Mixture-of-Experts architecture with 15B active parameters out of 309B total, supported by Multi-Teacher On-Policy Distillation that transfers expertise from specialized teachers via token-level rewards.
If this is right
- The model reaches comparable reasoning and agentic performance to systems with two or three times more total parameters.
- Inference runs up to 2.6 times faster with 3.6 token acceptance length by treating MTP layers as a speculative draft model.
- Context length extends to 256k after initial 32k training without separate long-context pre-training.
- Open release of the model weights and three-layer MTP weights supports community use and further development.
Where Pith is reading between the lines
- Sparse activation paired with targeted distillation may let future models achieve high capability at lower memory and compute cost during deployment.
- Hybrid sliding-window and global attention offers a practical balance for long-context tasks that avoids full quadratic scaling.
- Reusing pre-training prediction heads for inference acceleration could generalize to other auxiliary objectives in language models.
Load-bearing premise
The unreported benchmark results and training details actually demonstrate performance rivaling DeepSeek-V3.2 and Kimi-K2 under comparable conditions.
What would settle it
Independent runs on the same public benchmarks where MiMo-V2-Flash scores noticeably below DeepSeek-V3.2 or Kimi-K2.
read the original abstract
We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MiMo-V2-Flash, a Mixture-of-Experts model with 309B total and 15B active parameters that uses hybrid sliding-window attention (128-token window at 5:1 ratio) interleaved with global attention. It is pre-trained on 27 trillion tokens with multi-token prediction (MTP), extended from 32k to 256k context, and post-trained via a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. The paper claims this model rivals DeepSeek-V3.2 and Kimi-K2 while using only half and one-third their parameters, respectively, and achieves up to 3.6 acceptance length and 2.6x decoding speedup by repurposing MTP layers for speculative decoding. The model and MTP weights are open-sourced.
Significance. If the performance claims hold under matched evaluation conditions, the work would demonstrate practical advances in parameter-efficient MoE scaling for reasoning and agentic capabilities, with the hybrid attention and MOPD methods offering reusable design insights. The open-sourcing of weights and MTP layers would provide immediate value for community replication and further research on speculative decoding.
major comments (2)
- [Abstract] Abstract: The claim that MiMo-V2-Flash 'rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively' is unsupported by any benchmark scores, tables, or evaluation details. No side-by-side results on MMLU, GSM8K, HumanEval or similar tasks are supplied, nor is there information on prompting, shot count, or whether baselines were re-evaluated under identical conditions. This is load-bearing for the central contribution.
- [Abstract] Abstract: The inference claim of 'up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers' is presented without experimental setup, hardware details, baseline comparisons, or acceptance-length distributions. This prevents assessment of whether the speedup is reproducible or generalizes beyond the reported conditions.
minor comments (2)
- [Abstract] The hybrid attention ratio is described as '5:1' without clarifying whether this denotes the fraction of SWA layers, the interleaving pattern, or another quantity; a diagram or explicit definition in the main text would improve clarity.
- [Abstract] The context-length extension from native 32k to 256k is mentioned without describing the method (e.g., RoPE scaling factors, continued pre-training schedule, or long-context benchmark results).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of results and experimental details.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that MiMo-V2-Flash 'rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively' is unsupported by any benchmark scores, tables, or evaluation details. No side-by-side results on MMLU, GSM8K, HumanEval or similar tasks are supplied, nor is there information on prompting, shot count, or whether baselines were re-evaluated under identical conditions. This is load-bearing for the central contribution.
Authors: We agree that the abstract claim would benefit from direct supporting evidence. The full manuscript contains benchmark tables and evaluation details in the Experiments section, but to make the abstract self-contained we will revise it to include key side-by-side scores on MMLU, GSM8K, HumanEval and related tasks, along with notes on prompting, shot counts, and confirmation that baselines were run under matched conditions. We will also add explicit references to the relevant tables. revision: yes
-
Referee: [Abstract] Abstract: The inference claim of 'up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers' is presented without experimental setup, hardware details, baseline comparisons, or acceptance-length distributions. This prevents assessment of whether the speedup is reproducible or generalizes beyond the reported conditions.
Authors: We acknowledge the abstract is overly concise on the inference results. The manuscript includes a dedicated section on speculative decoding that describes the MTP-layer repurposing, hardware setup, baseline autoregressive decoding, and acceptance-length statistics. We will revise the abstract to briefly summarize the experimental conditions, hardware, baseline, and key statistics (including distributions), and ensure the full section provides all reproducibility details. revision: yes
Circularity Check
No circularity: empirical technical report with no derivation chain
full rationale
The document is a model release report describing architecture choices (hybrid SWA/global attention, MTP pre-training, MOPD post-training), parameter counts, and benchmark rivalry claims. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Central performance assertions rest on external benchmark comparisons rather than internal definitions that loop back to the same quantities. The paper is therefore self-contained against external benchmarks with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- hybrid attention ratio
- sliding window size
axioms (1)
- domain assumption Standard transformer attention and MoE routing assumptions hold at this scale.
Forward citations
Cited by 41 Pith papers
-
Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
-
Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
Persistent 'Rock Tokens' in on-policy distillation resist teacher corrections, consume large gradient norms, yet add negligible value to reasoning, allowing targeted bypassing to streamline alignment.
-
Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows
EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-speciali...
-
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning
OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...
-
GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection
GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.
-
Rubric-based On-policy Distillation
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
-
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
-
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
-
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions
DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
-
Self-Distilled RLVR
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
-
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
-
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Sparse RL on capable teachers followed by dense distillation to students beats direct GRPO on students for verifiable math reasoning.
-
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and p...
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
-
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
-
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
-
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
Co-Evolving Policy Distillation
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
-
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 6...
-
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
-
How Transformers Learn to Plan via Multi-Token Prediction
Multi-token prediction induces a two-stage reverse reasoning process in Transformers via gradient decoupling, improving planning on synthetic and realistic tasks.
-
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
-
From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multil...
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
-
Multilingual Safety Alignment via Self-Distillation
MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
-
OptiMat Alloys: a FAIR, living database of multi-principal element alloys enabled by a conversational agent
OptiMat Alloys is a conversational AI system that maintains a living FAIR database of multi-principal element alloy calculations and enables natural-language, on-demand computations with built-in uncertainty checks.
-
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
-
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
-
A Survey of On-Policy Distillation for Large Language Models
On-policy distillation reframes LLM knowledge transfer as iterative correction on student trajectories rather than single-pass imitation, with the survey organizing the field along divergence design, feedback sources,...
Reference graph
Works this paper leans on
-
[1]
URLhttps://api.semanticscholar.org/CorpusID: 263610088. S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
URLhttps://arxiv.org/abs/2108.07732. Y. Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y. Dong, et al. Long- bench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. 22 In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 3639–3664,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
M. Balunović, J. Dekoninck, I. Petrov, N. Jovanović, and M. Vechev. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281,
work page internal anchor Pith review arXiv
-
[6]
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
V. Barres, H. Dong, S. Ray, X. Si, and K. Narasimhan.𝜏2-bench: Evaluating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Longformer: The Long-Document Transformer
I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150,
work page internal anchor Pith review Pith/arXiv arXiv 2004
- [8]
-
[9]
F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman, et al. Multipl-e: A scalable and extensible approach to benchmarking neural code generation.arXiv preprint arXiv:2208.08227,
-
[11]
URLhttps://arxiv.org/abs/1803.05457. K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.ArXiv preprint, abs/2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
URLhttps://arxiv.org/abs/2110.14168. X. Du, Y. Yao, K. Ma, B. Wang, T. Zheng, K. Zhu, M. Liu, Y. Liang, X. Jin, Z. Wei, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines.ArXiv preprint, abs/2502.14739,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
URLhttps://arxiv.org/abs/2502.14739. D.Dua,Y.Wang,P.Dasigi,G.Stanovsky,S.Singh,andM.Gardner. DROP:Areadingcomprehension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings ofthe 2019 Conferenceofthe North American Chapter ofthe Association for Computational Linguistics: Human Language Tech...
-
[14]
doi:10.18653/v1/N19-1246 , editor =
Association for Computational Linguistics. doi: 10.18653/v1/N19-1246. URLhttps://aclanthology.org/N19-1246. W. Fu, J. Gao, X. Shen, C. Zhu, Z. Mei, C. He, S. Xu, G. Wei, J. Mei, J. Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298,
- [15]
-
[17]
Are we done with mmlu? CoRR, abs/2406.04127,
URL https://arxiv.org/abs/2406.04127. Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,
-
[18]
Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
23 F. Gloeckle, B. Y. Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737,
-
[20]
A. Gu, B. Rozière, H. J. Leather, A. Solar-Lezama, G. Synnaeve, and S. Wang. Cruxeval: A bench- mark for code reasoning, understanding and execution. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,
work page 2024
-
[21]
When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,
OpenReview.net, 2024a. URLhttps://openreview.net/forum?id=Ffpg52swvg. X. Gu, T. Pang, C. Du, Q. Liu, F. Zhang, C. Du, Y. Wang, and M. Lin. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781, 2024b. Y. Gu, L. Dong, F. Wei, and M. Huang. Minillm: Knowledge distillation of large language models. In Proceedings o...
-
[22]
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measur- ing massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,
work page 2021
-
[23]
Measuring Mathematical Problem Solving With the MATH Dataset
OpenReview.net, 2021a. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ. D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Mea- suring mathematical problem solving with the math dataset.ArXiv preprint, abs/2103.03874, 2021b. URLhttps://arxiv.org/abs/2103.03874. C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekes...
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, Y. Fu, M. Sun, and J. He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, edi- tors, Advances in Neural Information Processing Systems 36: Annual Conference on ...
work page 2023
-
[26]
URLhttps://arxiv.org/abs/2403.07974. C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URLhttps://aclanthology.org/P17-1147. Kimi Team. Kimi linear: An expressive, efficient attention architecture. arXiv preprint arXiv:2510.26692, 2025a. KimiTeam. Kimik1.5: Scalingreinforcementlearningwithllms. arXivpreprintarXiv:2501.12599, 2025b. Kimi Team. Kimi k2: Open agentic intelli...
- [29]
-
[30]
URLhttps://lmsys.org/ blog/2024-04-19-arena-hard/. A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
J. Liu, C. S. Xia, Y. Wang, and L. Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In A. Oh, T. Naumann, A. Globerson, K.Saenko, M.Hardt, andS.Levine, editors, AdvancesinNeuralInformationProcessingSystems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS ...
work page 2023
-
[33]
URLhttp://papers.nips.cc/paper_f iles/paper/2023/hash/43e9d647ccd3e4b7b5baab53f0368686-Abstract-Confere nce.html. K. Lu and T. M. Lab. On-policy distillation.Thinking Machines Lab: Connectionism,
work page 2023
-
[34]
https://thinkingmachines.ai/blog/ on-policy-distillation/
doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. W. Ma, H. Zhang, L. Zhao, Y. Song, Y. Wang, Z. Sui, and F. Luo. Stabilizing moe reinforcement learning by aligning training and inference routers.arXiv preprint arXiv:2510.11370,
- [35]
-
[36]
L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Z. Qiu, Z. Wang, B. Zheng, Z. Huang, K. Wen, S. Yang, R. Men, L. Yu, F. Huang, S. Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
RadixArk Team. Introducing miles — rl framework to fire up large-scale moe training.https: //lmsys.org/blog/2025-11-19-miles/, Nov
work page 2025
-
[39]
arXiv preprint arXiv:2411.19799 , year=
A. Romanou, N. Foroutan, A. Sotnikova, Z. Chen, S. H. Nelaturu, S. Singh, R. Maheshwary, M. Altomare, M. A. Haggag, A. Amayuelas, et al. Include: Evaluating multilingual language understanding with regional knowledge.arXiv preprint arXiv:2411.19799,
-
[40]
K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale. InThe Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI2020,TheTenthAAAISymposiumonEducationalAdvancesinArtificialIntelligence,EAAI 20...
work page 2020
-
[41]
URL https://aaai.org/ojs/index.php/AAAI/article/view/6399. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
URLhttps://arxiv.org/abs/2402.03300. N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXivpreprint arXiv:1701.06538,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
- [44]
-
[45]
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. Le, E. Chi, D. Zhou, and J. Wei. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors,Findings of the Association for ComputationalLinguistics: ACL2023, pages13003–13051, Toronto, Canada, 2023.Assoc...
-
[46]
Michelangelo: Long context evaluations beyond haystacks via latent structure queries, 2024
K.Vodrahalli, S.Ontanon, N.Tripuraneni, K.Xu, S.Jain, R.Shivanna, J.Hui, N.Dikkala, M.Kazemi, B. Fatemi, et al. Michelangelo: Long context evaluations beyond haystacks via latent structure queries. arXiv preprint arXiv:2409.12640,
-
[47]
Y.Wang, X.Ma, G.Zhang, Y.Ni, A.Chandra, S.Guo, W.Ren, A.Arulraj, X.He, Z.Jiang, T.Li, M.Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Pr...
work page 2024
-
[48]
URLhttp://pape rs.nips.cc/paper_files/paper/2024/hash/ad236edc564f3e3156e1b2feafb99a2 4-Abstract-Datasets_and_Benchmarks_Track.html. J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [49]
-
[50]
C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489,
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,
work page internal anchor Pith review Pith/arXiv arXiv
- [52]
-
[53]
H ella S wag: Can a Machine Really Finish Your Sentence?
Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472. X. Zhao, Y. Liu, K. Xu, J. Guo, Z. Wang, Y. Sun, X. Kong, Q. Cao, L. Jiang, Z. Wen, Z. Zhang, and J. Zhou. Small leak can sink a great ship–boost rl training on moe with icepop!, Sep
-
[54]
URL https://ringtech.notion.site/icepop. C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,
work page internal anchor Pith review Pith/arXiv arXiv
- [55]
-
[56]
T. Y. Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877,
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.