arxiv: 2601.02780 · v2 · submitted 2026-01-06 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MiMo-V2-Flash Technical Report

Xiaomi LLM-Core Team: Bangjun Xiao , Bingquan Xia , Bo Yang , Bofei Gao , Bowen Shen , Chen Zhang , Chenhong He , Chiheng Lou

show 117 more authors

Fuli Luo Gang Wang Gang Xie Hailin Zhang Hanglong Lv Hanyu Li Heyu Chen Hongshen Xu Houbin Zhang Huaqiu Liu Jiangshan Duo Jianyu Wei Jiebao Xiao Jinhao Dong Jun Shi Junhao Hu Kainan Bao Kang Zhou Lei Li Liang Zhao Linghao Zhang Peidian Li Qianli Chen Shaohui Liu Shihua Yu Shijie Cao Shimao Chen Shouqiu Yu Shuo Liu Tianling Zhou Weijiang Su Weikun Wang Wenhan Ma Xiangwei Deng Bohan Mao Bowen Ye Can Cai Chenghua Wang Chengxuan Zhu Chong Ma Chun Chen Chunan Li Dawei Zhu Deshan Xiao Dong Zhang Duo Zhang Fangyue Liu Feiyu Yang Fengyuan Shi Guoan Wang Hao Tian Hao Wu Heng Qu Hongfei Yi Hongxu An Hongyi Guan Xing Zhang Yifan Song Yihan Yan Yihao Zhao Yingchun Lai Yizhao Gao Yu Cheng Yuanyuan Tian Yudong Wang Zhen Tang Zhengju Tang Zhengtao Wen Zhichao Song Zhixian Zheng Zihan Jiang Jian Wen Jiarui Sun Jiawei Li Jinlong Xue Jun Xia Kai Fang Menghang Zhu Nuo Chen Qian Tu Qihao Zhang Qiying Wang Rang Li Rui Ma Shaolei Zhang Shengfan Wang Shicheng Li Shuhao Gu Shuhuai Ren Sirui Deng Tao Guo Tianyang Lu Weiji Zhuang Weikang Zhang Weimin Xiong Wenshan Huang Wenyu Yang Xin Zhang Xing Yong Xu Wang Xueyang Xie Yilin Jiang Yixin Yang Yongzhe He Yu Tu Yuanliang Dong Yuchen Liu Yue Ma Yue Yu Yuxing Xiang Zhaojun Huang Zhenru Lin Zhipeng Xu Zhiyang Chen Zhonghua Deng Zihan Zhang Zihao Yue

Authors on Pith no claims yet

Pith reviewed 2026-05-12 11:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Mixture of Expertsmulti-token predictionspeculative decodingdistillationhybrid attentionlanguage modelparameter efficiencyopen-source model

0 comments

The pith

MiMo-V2-Flash matches top open-weight models like DeepSeek-V3.2 using half their total parameters via sparse MoE design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MiMo-V2-Flash as a Mixture-of-Experts model with 309 billion total parameters but only 15 billion active during use. It pre-trains on 27 trillion tokens with multi-token prediction and a hybrid attention setup that alternates sliding windows with global attention, then extends context from 32k to 256k. A new Multi-Teacher On-Policy Distillation method lets specialized teachers supply token-level guidance so the student fully acquires their capabilities. This enables the model to rival larger systems while delivering inference speedups by reusing its prediction layers as a draft model for speculative decoding.

Core claim

MiMo-V2-Flash is a Mixture-of-Experts model with 309B total parameters and 15B active parameters that rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2 despite using only half or one-third their total parameters. It adopts a hybrid attention architecture interleaving Sliding Window Attention with global attention under a 5:1 ratio and a 128-token window, pre-trains with Multi-Token Prediction on 27 trillion tokens, and introduces Multi-Teacher On-Policy Distillation where domain-specialized teachers provide dense token-level rewards. The model extends to 256k context and repurposes MTP layers for speculative decoding to reach up to 3.6 acceptance length and 2.6x speedup,开放

What carries the argument

Mixture-of-Experts architecture with 15B active parameters out of 309B total, supported by Multi-Teacher On-Policy Distillation that transfers expertise from specialized teachers via token-level rewards.

If this is right

The model reaches comparable reasoning and agentic performance to systems with two or three times more total parameters.
Inference runs up to 2.6 times faster with 3.6 token acceptance length by treating MTP layers as a speculative draft model.
Context length extends to 256k after initial 32k training without separate long-context pre-training.
Open release of the model weights and three-layer MTP weights supports community use and further development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Sparse activation paired with targeted distillation may let future models achieve high capability at lower memory and compute cost during deployment.
Hybrid sliding-window and global attention offers a practical balance for long-context tasks that avoids full quadratic scaling.
Reusing pre-training prediction heads for inference acceleration could generalize to other auxiliary objectives in language models.

Load-bearing premise

The unreported benchmark results and training details actually demonstrate performance rivaling DeepSeek-V3.2 and Kimi-K2 under comparable conditions.

What would settle it

Independent runs on the same public benchmarks where MiMo-V2-Flash scores noticeably below DeepSeek-V3.2 or Kimi-K2.

read the original abstract

We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents MiMo-V2-Flash, a Mixture-of-Experts model with 309B total and 15B active parameters that uses hybrid sliding-window attention (128-token window at 5:1 ratio) interleaved with global attention. It is pre-trained on 27 trillion tokens with multi-token prediction (MTP), extended from 32k to 256k context, and post-trained via a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. The paper claims this model rivals DeepSeek-V3.2 and Kimi-K2 while using only half and one-third their parameters, respectively, and achieves up to 3.6 acceptance length and 2.6x decoding speedup by repurposing MTP layers for speculative decoding. The model and MTP weights are open-sourced.

Significance. If the performance claims hold under matched evaluation conditions, the work would demonstrate practical advances in parameter-efficient MoE scaling for reasoning and agentic capabilities, with the hybrid attention and MOPD methods offering reusable design insights. The open-sourcing of weights and MTP layers would provide immediate value for community replication and further research on speculative decoding.

major comments (2)

[Abstract] Abstract: The claim that MiMo-V2-Flash 'rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively' is unsupported by any benchmark scores, tables, or evaluation details. No side-by-side results on MMLU, GSM8K, HumanEval or similar tasks are supplied, nor is there information on prompting, shot count, or whether baselines were re-evaluated under identical conditions. This is load-bearing for the central contribution.
[Abstract] Abstract: The inference claim of 'up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers' is presented without experimental setup, hardware details, baseline comparisons, or acceptance-length distributions. This prevents assessment of whether the speedup is reproducible or generalizes beyond the reported conditions.

minor comments (2)

[Abstract] The hybrid attention ratio is described as '5:1' without clarifying whether this denotes the fraction of SWA layers, the interleaving pattern, or another quantity; a diagram or explicit definition in the main text would improve clarity.
[Abstract] The context-length extension from native 32k to 256k is mentioned without describing the method (e.g., RoPE scaling factors, continued pre-training schedule, or long-context benchmark results).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of results and experimental details.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that MiMo-V2-Flash 'rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively' is unsupported by any benchmark scores, tables, or evaluation details. No side-by-side results on MMLU, GSM8K, HumanEval or similar tasks are supplied, nor is there information on prompting, shot count, or whether baselines were re-evaluated under identical conditions. This is load-bearing for the central contribution.

Authors: We agree that the abstract claim would benefit from direct supporting evidence. The full manuscript contains benchmark tables and evaluation details in the Experiments section, but to make the abstract self-contained we will revise it to include key side-by-side scores on MMLU, GSM8K, HumanEval and related tasks, along with notes on prompting, shot counts, and confirmation that baselines were run under matched conditions. We will also add explicit references to the relevant tables. revision: yes
Referee: [Abstract] Abstract: The inference claim of 'up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers' is presented without experimental setup, hardware details, baseline comparisons, or acceptance-length distributions. This prevents assessment of whether the speedup is reproducible or generalizes beyond the reported conditions.

Authors: We acknowledge the abstract is overly concise on the inference results. The manuscript includes a dedicated section on speculative decoding that describes the MTP-layer repurposing, hardware setup, baseline autoregressive decoding, and acceptance-length statistics. We will revise the abstract to briefly summarize the experimental conditions, hardware, baseline, and key statistics (including distributions), and ensure the full section provides all reproducibility details. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical technical report with no derivation chain

full rationale

The document is a model release report describing architecture choices (hybrid SWA/global attention, MTP pre-training, MOPD post-training), parameter counts, and benchmark rivalry claims. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Central performance assertions rest on external benchmark comparisons rather than internal definitions that loop back to the same quantities. The paper is therefore self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard transformer and MoE training assumptions plus the unreported empirical results; no new physical entities or ad-hoc constants are introduced beyond typical hyper-parameters such as the 5:1 attention ratio and 128-token window.

free parameters (2)

hybrid attention ratio
5:1 SWA-to-global ratio chosen to balance speed and quality; value stated without derivation from first principles.
sliding window size
128 tokens; selected for inference efficiency.

axioms (1)

domain assumption Standard transformer attention and MoE routing assumptions hold at this scale.
Invoked implicitly when claiming the architecture trains and runs as described.

pith-pipeline@v0.9.0 · 6053 in / 1376 out tokens · 68802 ms · 2026-05-12T11:27:42.763790+00:00 · methodology

discussion (0)

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
cs.SD 2026-05 unverdicted novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
cs.LG 2026-05 unverdicted novelty 7.0

Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 7.0

Persistent 'Rock Tokens' in on-policy distillation resist teacher corrections, consume large gradient norms, yet add negligible value to reasoning, allowing targeted bypassing to streamline alignment.
Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows
cs.MA 2026-05 unverdicted novelty 7.0

EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-speciali...
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 conditional novelty 7.0

Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...
GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection
cs.LG 2026-05 unverdicted novelty 7.0

GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.
Rubric-based On-policy Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
cs.LG 2026-04 unverdicted novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
cs.CL 2026-04 conditional novelty 7.0

SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions
cs.CV 2026-04 unverdicted novelty 7.0

DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
Self-Distilled RLVR
cs.LG 2026-04 unverdicted novelty 7.0

RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
cs.LG 2026-01 unverdicted novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

Sparse RL on capable teachers followed by dense distillation to students beats direct GRPO on students for verifiable math reasoning.
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
cs.LG 2026-05 unverdicted novelty 6.0

Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and p...
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
cs.LG 2026-05 unverdicted novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 unverdicted novelty 6.0

Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 unverdicted novelty 6.0

Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
cs.CL 2026-05 unverdicted novelty 6.0

UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
cs.LG 2026-05 unverdicted novelty 6.0

Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
Co-Evolving Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
cs.DC 2026-04 unverdicted novelty 6.0

PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 6...
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
cs.LG 2026-04 unverdicted novelty 6.0

On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
How Transformers Learn to Plan via Multi-Token Prediction
cs.LG 2026-04 conditional novelty 6.0

Multi-token prediction induces a two-stage reverse reasoning process in Transformers via gradient decoupling, improving planning on synthetic and realistic tasks.
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
cs.CL 2026-04 conditional novelty 6.0

Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
cs.SE 2026-04 unverdicted novelty 6.0

A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multil...
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
cs.LG 2026-05 unverdicted novelty 5.0

Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 5.0

MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
OptiMat Alloys: a FAIR, living database of multi-principal element alloys enabled by a conversational agent
cond-mat.mtrl-sci 2026-04 unverdicted novelty 5.0

OptiMat Alloys is a conversational AI system that maintains a living FAIR database of multi-principal element alloy calculations and enables natural-language, on-demand computations with built-in uncertainty checks.
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
cs.CL 2026-04 accept novelty 5.0

LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
cs.CL 2026-04 unverdicted novelty 5.0

JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
A Survey of On-Policy Distillation for Large Language Models
cs.LG 2026-04 unverdicted novelty 2.0

On-policy distillation reframes LLM knowledge transfer as iterative correction on student trajectories rather than single-pass imitation, with the survey organizing the field along divergence design, feedback sources,...

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 34 Pith papers · 23 internal anchors

[1]

URLhttps://api.semanticscholar.org/CorpusID: 263610088. S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

URLhttps://arxiv.org/abs/2108.07732. Y. Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y. Dong, et al. Long- bench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. 22 In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 3639–3664,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

M. Balunović, J. Dekoninck, I. Petrov, N. Jovanović, and M. Vechev. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281,

work page internal anchor Pith review arXiv
[6]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

V. Barres, H. Dong, S. Ray, X. Si, and K. Narasimhan.𝜏2-bench: Evaluating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[8]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G.Sastry,A.Askell,etal. Languagemodelsarefew-shotlearners. Advancesinneuralinformation processing systems, 33:1877–1901,

work page 1901
[9]

J., Feldman, M

F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman, et al. Multipl-e: A scalable and extensible approach to benchmarking neural code generation.arXiv preprint arXiv:2208.08227,

work page arXiv
[11]

URLhttps://arxiv.org/abs/1803.05457. K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.ArXiv preprint, abs/2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

URLhttps://arxiv.org/abs/2110.14168. X. Du, Y. Yao, K. Ma, B. Wang, T. Zheng, K. Zhu, M. Liu, Y. Liang, X. Jin, Z. Wei, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines.ArXiv preprint, abs/2502.14739,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025

URLhttps://arxiv.org/abs/2502.14739. D.Dua,Y.Wang,P.Dasigi,G.Stanovsky,S.Singh,andM.Gardner. DROP:Areadingcomprehension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings ofthe 2019 Conferenceofthe North American Chapter ofthe Association for Computational Linguistics: Human Language Tech...

work page arXiv 2019
[14]

doi:10.18653/v1/N19-1246 , editor =

Association for Computational Linguistics. doi: 10.18653/v1/N19-1246. URLhttps://aclanthology.org/N19-1246. W. Fu, J. Gao, X. Shen, C. Zhu, Z. Mei, C. He, S. Xu, G. Wei, J. Mei, J. Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298,

work page doi:10.18653/v1/n19-1246
[15]

W. Gao, Y. Zhao, D. An, T. Wu, L. Cao, S. Xiong, J. Huang, W. Wang, S. Yang, W. Su, et al. Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009,

work page arXiv
[17]

Are we done with mmlu? CoRR, abs/2406.04127,

URL https://arxiv.org/abs/2406.04127. Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page arXiv
[18]

Gemma 3 Technical Report

Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

23 F. Gloeckle, B. Y. Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737,

work page arXiv
[20]

A. Gu, B. Rozière, H. J. Leather, A. Solar-Lezama, G. Synnaeve, and S. Wang. Cruxeval: A bench- mark for code reasoning, understanding and execution. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,

work page 2024
[21]

When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,

OpenReview.net, 2024a. URLhttps://openreview.net/forum?id=Ffpg52swvg. X. Gu, T. Pang, C. Du, Q. Liu, F. Zhang, C. Du, Y. Wang, and M. Lin. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781, 2024b. Y. Gu, L. Dong, F. Wei, and M. Huang. Minillm: Knowledge distillation of large language models. In Proceedings o...

work page arXiv
[22]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measur- ing massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,

work page 2021
[23]

Measuring Mathematical Problem Solving With the MATH Dataset

OpenReview.net, 2021a. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ. D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Mea- suring mathematical problem solving with the math dataset.ArXiv preprint, abs/2103.03874, 2021b. URLhttps://arxiv.org/abs/2103.03874. C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekes...

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Huang, Y

Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, Y. Fu, M. Sun, and J. He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, edi- tors, Advances in Neural Information Processing Systems 36: Annual Conference on ...

work page 2023
[26]

URLhttps://arxiv.org/abs/2403.07974. C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URLhttps://aclanthology.org/P17-1147. Kimi Team. Kimi linear: An expressive, efficient attention architecture. arXiv preprint arXiv:2510.26692, 2025a. KimiTeam. Kimik1.5: Scalingreinforcementlearningwithllms. arXivpreprintarXiv:2501.12599, 2025b. Kimi Team. Kimi k2: Open agentic intelli...

work page doi:10.18653/v1/p17-1147
[29]

URLhttps://arxiv.org/abs/2306.09212. T. Li, W.-L. Chiang, E. Frick, L. Dunlap, B. Zhu, J. E. Gonzalez, and I. Stoica. From live data to high-quality benchmarks: The arena-hard pipeline, April

work page arXiv
[30]

URLhttps://lmsys.org/ blog/2024-04-19-arena-hard/. A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

J. Liu, C. S. Xia, Y. Wang, and L. Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In A. Oh, T. Naumann, A. Globerson, K.Saenko, M.Hardt, andS.Levine, editors, AdvancesinNeuralInformationProcessingSystems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS ...

work page 2023
[33]

URLhttp://papers.nips.cc/paper_f iles/paper/2023/hash/43e9d647ccd3e4b7b5baab53f0368686-Abstract-Confere nce.html. K. Lu and T. M. Lab. On-policy distillation.Thinking Machines Lab: Connectionism,

work page 2023
[34]

https://thinkingmachines.ai/blog/ on-policy-distillation/

doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. W. Ma, H. Zhang, L. Zhao, Y. Song, Y. Wang, Z. Sui, and F. Luo. Stabilizing moe reinforcement learning by aligning training and inference routers.arXiv preprint arXiv:2510.11370,

work page doi:10.64434/tml.20251026
[35]

URL https://maa.org/math-competition s/american-invitational-mathematics-examination-aime. 25 A. Modarressi, H. Deilamsalehy, F. Dernoncourt, T. Bui, R. A. Rossi, S. Yoon, and H. Schütze. Nolima: Long-context evaluation beyond literal matching.arXiv preprint arXiv:2502.05167,

work page arXiv
[36]

L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Z. Qiu, Z. Wang, B. Zheng, Z. Huang, K. Wen, S. Yang, R. Men, L. Yu, F. Huang, S. Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Introducing miles — rl framework to fire up large-scale moe training.https: //lmsys.org/blog/2025-11-19-miles/, Nov

RadixArk Team. Introducing miles — rl framework to fire up large-scale moe training.https: //lmsys.org/blog/2025-11-19-miles/, Nov

work page 2025
[39]

arXiv preprint arXiv:2411.19799 , year=

A. Romanou, N. Foroutan, A. Sotnikova, Z. Chen, S. H. Nelaturu, S. Singh, R. Maheshwary, M. Altomare, M. A. Haggag, A. Amayuelas, et al. Include: Evaluating multilingual language understanding with regional knowledge.arXiv preprint arXiv:2411.19799,

work page arXiv
[40]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale. InThe Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI2020,TheTenthAAAISymposiumonEducationalAdvancesinArtificialIntelligence,EAAI 20...

work page 2020
[41]

URL https://aaai.org/ojs/index.php/AAAI/article/view/6399. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

URLhttps://arxiv.org/abs/2402.03300. N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXivpreprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[44]

M. Sun, X. Chen, J. Z. Kolter, and Z. Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762,

work page arXiv
[45]

Le, Ed H

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. Le, E. Chi, D. Zhou, and J. Wei. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors,Findings of the Association for ComputationalLinguistics: ACL2023, pages13003–13051, Toronto, Canada, 2023.Assoc...

work page doi:10.18653/v1/2023.findings-acl.824 2023
[46]

Michelangelo: Long context evaluations beyond haystacks via latent structure queries, 2024

K.Vodrahalli, S.Ontanon, N.Tripuraneni, K.Xu, S.Jain, R.Shivanna, J.Hui, N.Dikkala, M.Kazemi, B. Fatemi, et al. Michelangelo: Long context evaluations beyond haystacks via latent structure queries. arXiv preprint arXiv:2409.12640,

work page arXiv
[47]

Y.Wang, X.Ma, G.Zhang, Y.Ni, A.Chandra, S.Guo, W.Ren, A.Arulraj, X.He, Z.Jiang, T.Li, M.Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Pr...

work page 2024
[48]

URLhttp://pape rs.nips.cc/paper_files/paper/2024/hash/ad236edc564f3e3156e1b2feafb99a2 4-Abstract-Datasets_and_Benchmarks_Track.html. J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

B. Xia, B. Shen, D. Zhu, D. Zhang, G. Wang, H. Zhang, H. Liu, J. Xiao, J. Dong, L. Zhao, et al. Mimo: Unlocking the reasoning potential of language model–from pretraining to posttraining. arXiv preprint arXiv:2505.07608,

work page arXiv
[50]

C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489,

work page internal anchor Pith review Pith/arXiv arXiv
[51]

G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv
[52]

URLhttps: //arxiv.org/abs/2504.21798. 27 F. Yao, L. Liu, D. Zhang, C. Dong, J. Shang, and J. Gao. Your efficient rl framework secretly brings you off-policy rl training, Aug

work page arXiv
[53]

H ella S wag: Can a Machine Really Finish Your Sentence?

Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472. X. Zhao, Y. Liu, K. Xu, J. Guo, Z. Wang, Y. Sun, X. Kong, Q. Cao, L. Jiang, Z. Wen, Z. Zhang, and J. Zhou. Small leak can sink a great ship–boost rl training on moe with icepop!, Sep

work page doi:10.18653/v1/p19-1472
[54]

URL https://ringtech.notion.site/icepop. C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Y. Zhou, H. Liu, Z. Chen, Y. Tian, and B. Chen. Gsm-infinite: How do your llms behave over infinitelyincreasingcontextlengthandreasoningcomplexity? arXivpreprintarXiv:2502.05252,

work page arXiv
[56]

T. Y. Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877,

work page internal anchor Pith review arXiv