Recognition: no theorem link
Kimi K2: Open Agentic Intelligence
Pith reviewed 2026-05-10 17:44 UTC · model grok-4.3
The pith
Kimi K2 is a 1-trillion-parameter open MoE model that leads non-thinking models on agentic and software engineering benchmarks through stable pre-training and environment-based post-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Kimi K2 is a Mixture-of-Experts large language model with 32 billion activated parameters and 1 trillion total parameters. It reaches state-of-the-art results among open-source non-thinking models on agentic tasks by pre-training on 15.5 trillion tokens using the MuonClip optimizer with zero loss spikes and then applying a multi-stage post-training process that includes large-scale agentic data synthesis and joint reinforcement learning with environments. This yields strong performance in coding, mathematics, and reasoning without extended thinking.
What carries the argument
The MuonClip optimizer, which adds a QK-clip technique to Muon for training stability while preserving token efficiency, together with the multi-stage post-training pipeline that combines agentic data synthesis and joint reinforcement learning through real and synthetic environment interactions.
If this is right
- Open-source models can reach high levels of performance in software engineering and agentic tasks without closed-source resources or extended reasoning.
- The release of both base and post-trained checkpoints allows the community to continue research on agentic intelligence.
- Large-scale pre-training on trillions of tokens can proceed stably using the described optimizer technique.
- Agentic capabilities strengthen when models interact with environments during reinforcement learning stages.
- Strong results in multilingual coding benchmarks follow from the same training process.
Where Pith is reading between the lines
- Releasing the model weights may let independent groups combine or fine-tune Kimi K2 for new agentic applications faster than single-lab development allows.
- If the stable training method works at this scale, it could simplify hyperparameter choices for future trillion-parameter pre-training runs.
- Success on multilingual software benchmarks points to a route for building coding tools that work across languages without separate models for each.
Load-bearing premise
The reported benchmark scores reflect genuine generalizable agentic and coding ability rather than optimization specific to those test sets.
What would settle it
Running the released checkpoints on new, previously unseen agentic and coding benchmarks to check whether performance remains at the claimed leading level among open non-thinking models.
read the original abstract
We introduce Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. We propose the MuonClip optimizer, which improves upon Muon with a novel QK-clip technique to address training instability while enjoying the advanced token efficiency of Muon. Based on MuonClip, K2 was pre-trained on 15.5 trillion tokens with zero loss spike. During post-training, K2 undergoes a multi-stage post-training process, highlighted by a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage, where the model improves its capabilities through interactions with real and synthetic environments. Kimi K2 achieves state-of-the-art performance among open-source non-thinking models, with strengths in agentic capabilities. Notably, K2 obtains 66.1 on Tau2-Bench, 76.5 on ACEBench (En), 65.8 on SWE-Bench Verified, and 47.3 on SWE-Bench Multilingual -- surpassing most open and closed-sourced baselines in non-thinking settings. It also exhibits strong capabilities in coding, mathematics, and reasoning tasks, with a score of 53.7 on LiveCodeBench v6, 49.5 on AIME 2025, 75.1 on GPQA-Diamond, and 27.1 on OJBench, all without extended thinking. These results position Kimi K2 as one of the most capable open-source large language models to date, particularly in software engineering and agentic tasks. We release our base and post-trained model checkpoints to facilitate future research and applications of agentic intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Kimi K2, a Mixture-of-Experts LLM with 32B activated parameters and 1T total parameters. It describes the MuonClip optimizer (an extension of Muon with QK-clip for stability), pre-training on 15.5 trillion tokens with zero loss spikes, and multi-stage post-training involving large-scale agentic data synthesis and joint RL with real/synthetic environments. The model is reported to achieve SOTA results among open-source non-thinking models on agentic and coding benchmarks (Tau2-Bench 66.1, ACEBench En 76.5, SWE-Bench Verified 65.8, SWE-Bench Multilingual 47.3, LiveCodeBench v6 53.7, AIME 2025 49.5, GPQA-Diamond 75.1, OJBench 27.1) without extended thinking, with both base and post-trained checkpoints released.
Significance. If the empirical results hold under independent verification, the work is significant for releasing a competitive open-source model strong in agentic, software engineering, and reasoning tasks, closing some of the gap with closed models in non-thinking settings. The model release directly enables reproducibility and further research on agentic intelligence. The MuonClip optimizer is presented as a practical contribution for stable large-scale training, though its isolated impact requires more evidence.
major comments (2)
- [Pre-training description] Pre-training section: The assertion of training on 15.5 trillion tokens with 'zero loss spike' using MuonClip is stated without any loss curves, stability metrics, or ablation comparisons to baseline optimizers (e.g., AdamW or standard Muon). This detail is load-bearing for claims about the optimizer's effectiveness and the overall training narrative, even if final benchmark scores are the primary result.
- [Results and benchmarks] Evaluation and results sections: Reported benchmark scores (e.g., Tau2-Bench 66.1, SWE-Bench Verified 65.8) lack accompanying details on exact evaluation protocols, prompting formats, temperature settings, or error bars from multiple runs. While model release allows verification of the numbers themselves, the absence of these elements weakens assessment of robustness and generalizability versus potential test-set optimization.
minor comments (4)
- [Abstract and §1] The abstract and introduction would benefit from a clearer statement of the total vs. activated parameter count and how the MoE architecture is configured (e.g., number of experts, routing details).
- [Post-training] Post-training description mentions 'joint reinforcement learning (RL) stage' but provides no specifics on the RL algorithm, reward model, or environment interaction details, which would aid reproducibility.
- [Tables and figures] Figure and table captions could be expanded to include exact benchmark versions, baselines compared, and whether results are from the base or post-trained model.
- [Discussion or new section] The paper should include a limitations section addressing potential data contamination risks for the reported benchmarks, given the scale of pre-training data.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation of minor revision. We address the two major comments point by point below, agreeing that additional transparency will strengthen the manuscript. We will incorporate the requested details in the revised version.
read point-by-point responses
-
Referee: [Pre-training description] Pre-training section: The assertion of training on 15.5 trillion tokens with 'zero loss spike' using MuonClip is stated without any loss curves, stability metrics, or ablation comparisons to baseline optimizers (e.g., AdamW or standard Muon). This detail is load-bearing for claims about the optimizer's effectiveness and the overall training narrative, even if final benchmark scores are the primary result.
Authors: We agree that the pre-training stability claim would benefit from supporting evidence. Full ablations at 1T-parameter scale are computationally prohibitive and were not performed, but we will add a pre-training loss curve figure in the revised manuscript (or appendix) to demonstrate the absence of spikes across the 15.5T tokens. We will also briefly describe the QK-clip mechanism and its observed effect on gradient norms during development runs to provide context for the optimizer's contribution. revision: yes
-
Referee: [Results and benchmarks] Evaluation and results sections: Reported benchmark scores (e.g., Tau2-Bench 66.1, SWE-Bench Verified 65.8) lack accompanying details on exact evaluation protocols, prompting formats, temperature settings, or error bars from multiple runs. While model release allows verification of the numbers themselves, the absence of these elements weakens assessment of robustness and generalizability versus potential test-set optimization.
Authors: We accept this point and will expand the evaluation section in the revision. We will include a table or subsection specifying prompting formats, temperature (typically 0.0 for deterministic agentic/coding benchmarks), top-p, and other sampling parameters for each reported score. We will also note that the results reflect single runs on standard benchmarks and that the open release of both base and post-trained checkpoints enables independent multi-run verification and statistical analysis by the community. revision: yes
Circularity Check
No significant circularity; empirical claims only
full rationale
The paper's central claims are direct empirical benchmark scores (e.g., 66.1 on Tau2-Bench, 65.8 on SWE-Bench Verified) obtained by running the released model checkpoints under standard evaluation protocols. The MuonClip optimizer is introduced as a practical training technique with a QK-clip modification, but no derivation, equation, or prediction reduces by construction to fitted parameters, self-referential normalizations, or prior self-citations. Training statements such as 'zero loss spike' on 15.5 trillion tokens are factual process descriptions, not outputs derived from the model's own equations or ansatzes. No uniqueness theorems, load-bearing self-citations, or renamed known results are invoked to support the performance claims, which remain independently verifiable by third parties using the released weights.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
-
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
-
MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs
MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
-
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
-
LLM Translation of Compiler Intermediate Representation
IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
-
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
-
ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification
Introduces the ODUTQA-MDC task with a 25k-pair benchmark and MAIC-TQA multi-agent framework for detecting and clarifying underspecified open-domain tabular questions via dialogue.
-
GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction
A genome-conditioned 4B LLM agent predicts microbial life boundaries and matches larger frontier models via token fusion, tool use, and a counterfactual gene-grounding reward.
-
LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving
LeanSearch v2 recovers 46.1% of ground-truth premise groups for research-level Lean 4 theorems within 10 candidates and raises fixed-loop proof success to 20%.
-
LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving
LeanSearch v2 recovers 46.1% of ground-truth premise groups on research-level Mathlib theorems and raises fixed-loop proof success from 4% to 20% via embedding-reranker plus iterative sketch-retrieve-reflect retrieval.
-
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
-
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
-
Beyond Position Bias: Shifting Context Compression from Position-Driven to Semantic-Driven
SeCo performs semantic-driven context compression for LLMs by anchoring on query-relevant semantic centers and applying consistency-weighted token merging, yielding better downstream performance, lower latency, and st...
-
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
-
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
-
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.
-
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...
-
TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks
TrajShield is a training-free defense that reduces jailbreak success rates by 52.44% on average in text-to-video models by localizing and neutralizing risks through trajectory simulation and causal intervention.
-
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice
OralMLLM-Bench is a new benchmark with 27 tasks in four cognitive categories that evaluates six MLLMs on dental radiographs and shows clear performance gaps versus clinicians.
-
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice
OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.
-
Improving Vision-language Models with Perception-centric Process Reward Models
Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
-
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve pe...
-
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
-
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.
-
TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models
TrigReason matches large reasoning model accuracy on math and science benchmarks by delegating most steps to small models and intervening selectively on three triggers, cutting latency by 43.9% and cost by 73.3%.
-
AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning
A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.
-
E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning
E2E-REME outperforms nine LLMs in accuracy and efficiency for end-to-end microservice remediation by using experience-simulation reinforcement fine-tuning on a new benchmark called MicroRemed.
-
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
-
Beyond Compliance: A Resistance-Informed Motivation Reasoning Framework for Challenging Psychological Client Simulation
ResistClient creates more realistic challenging client simulators by combining resistance theory with supervised fine-tuning on a new dataset followed by process-supervised reinforcement learning for motivation reasoning.
-
Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards
A mid-sized LLM buyer trained with RL from verifiable economic rewards learns sophisticated negotiation tactics and extracts more surplus than frontier models over 10x larger.
-
SAGE: A Service Agent Graph-guided Evaluation Benchmark
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...
-
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
-
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
OmniBehavior benchmark demonstrates that LLMs simulating real human behavior converge on hyper-active positive average personas, losing long-tail individual differences.
-
Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling
Plan-RewardBench is a trajectory-level preference benchmark that evaluates how well reward models distinguish preferred agent trajectories from hard distractors across safety refusal, tool handling, complex planning, ...
-
Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling
Plan-RewardBench is a trajectory-level preference benchmark that shows existing reward models, including LLM judges, perform poorly on long-horizon agent trajectories in tool-using scenarios across safety, planning, a...
-
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...
-
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
-
BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs
BOSCH decomposes attention-head selection for short-context hybridization into layer probing, adaptive ratio assignment, and grouped binary optimization, yielding better efficiency-performance tradeoffs than static or...
-
DeonticBench: A Benchmark for Reasoning over Rules
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
-
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
-
Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models
DEMASK adds a lightweight pairwise-dependency predictor to dLLMs and uses greedy selection to enable parallel unmasking whose total-variation error is provably bounded under sub-additivity.
-
MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese
Math-PT provides 1,729 native Portuguese math problems and shows frontier LLMs perform well on multiple-choice but drop on figures and open-ended items.
-
Think Anywhere in Code Generation
Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.
-
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
-
Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
Reinforcement learning with semantic rewards lets LLMs gain low-resource language skills without the alignment tax that degrades general capabilities in supervised fine-tuning.
-
MinT: Managed Infrastructure for Training and Serving Millions of LLMs
MinT enables efficient management of million-scale LoRA-adapted LLM policies over shared 1T-parameter base models by moving only small adapters through training and serving pipelines.
-
EMO: Frustratingly Easy Progressive Training of Extendable MoE
EMO progressively expands the expert pool in MoE models during training to match fixed-expert performance with improved wall-clock efficiency.
-
Context Training with Active Information Seeking
Adding active search tools to LLM context optimization works only when combined with a multi-candidate search-based training procedure that prunes contexts, delivering gains across low-resource translation, health, an...
-
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...
-
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
TABOM models inference unmasking preferences as a Boltzmann distribution over predictive entropies and derives a ranking loss to align DLM training with observed trajectories, yielding gains in new domains and reduced...
-
Selective Off-Policy Reference Tuning with Plan Guidance
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
-
Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities
A new benchmark uses separate predictor and scorer LLMs to test whether forecast strings improve likelihood of hidden mathematical equation continuations, with controls that detect priming shortcuts.
-
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
-
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
-
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
-
SecureForge: Finding and Preventing Vulnerabilities in LLM-Generated Code via Prompt Optimization
SecureForge audits LLM code for vulnerabilities, builds a synthetic prompt corpus via Markovian sampling, and optimizes system prompts to cut security issues by up to 48% while preserving unit test performance, with z...
-
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training lo...
-
WebTrap: Stealthy Mid-Task Hijacking of Browser Agents During Navigation
WebTrap uses multi-step instruction fusion and context-grounded generation to stealthily hijack browser agents mid-navigation while preserving original task success.
-
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
-
Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment
Degraded image resolution in MLLMs bypasses safety alignments via cognitive overload, raising jailbreak rates across perturbations.
Reference graph
Works this paper leans on
-
[1]
Jacob Austin et al.Program Synthesis with Large Language Models. 2021. arXiv: 2108.07732 [cs.PL]. URL:https://arxiv.org/abs/2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [2]
-
[3]
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
Victor Barres et al. τ2-Bench: Evaluating Conversational Agents in a Dual-Control Environment. 2025. arXiv: 2506.07982 [cs.AI].URL:https://arxiv.org/abs/2506.07982
work page internal anchor Pith review arXiv 2025
-
[4]
Stella Biderman et al. “Lessons from the trenches on reproducible evaluation of language models”. In:arXiv preprint arXiv:2405.14782(2024)
-
[5]
Greg Brockman et al.OpenAI Gym. 2016. arXiv: 1606.01540 [cs.LG].URL: https://arxiv.org/ abs/1606.01540
work page internal anchor Pith review arXiv 2016
-
[6]
MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Gen- eration
Federico Cassano et al. “MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Gen- eration”. In:IEEE Transactions on Software Engineering49.7 (2023), pp. 3675–3691.DOI: 10.1109/TSE. 2023.3267446
work page doi:10.1109/tse 2023
-
[7]
ACEBench: Who Wins the Match Point in Tool Learning?
Chen Chen et al. “ACEBench: Who Wins the Match Point in Tool Learning?” In:arXiv e-prints(2025), arXiv– 2501
work page 2025
-
[8]
Evaluating Large Language Models Trained on Code
Mark Chen et al. “Evaluating Large Language Models Trained on Code”. In: (2021). arXiv: 2107.03374 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark et al. “Think you have solved question answering? try arc, the ai2 reasoning challenge”. In:arXiv preprint arXiv:1803.05457(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
Karl Cobbe et al.Training Verifiers to Solve Math Word Problems. 2021. arXiv: 2110.14168 [cs.LG].URL: https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
DeepSeek-AI.DeepSeek-V3 Technical Report. 2024. arXiv: 2412 . 19437 [cs.CL].URL: https : / / arxiv.org/abs/2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Scaling vision transformers to 22 billion parameters
Mostafa Dehghani et al. “Scaling vision transformers to 22 billion parameters”. In:International conference on machine learning. PMLR. 2023, pp. 7480–7512
work page 2023
-
[13]
Guanting Dong et al.Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models. 2024. arXiv: 2406 . 13542 [cs.CL].URL: https : / / arxiv . org / abs / 2406 . 13542
work page 2024
-
[14]
Xinrun Du et al. “Supergpqa: Scaling llm evaluation across 285 graduate disciplines”. In:arXiv preprint arXiv:2502.14739(2025)
-
[15]
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
Dheeru Dua et al. “DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Para- graphs”. In:CoRRabs/1903.00161 (2019). arXiv: 1903.00161.URL: http://arxiv.org/abs/1903. 00161
work page Pith review arXiv 1903
- [16]
-
[17]
Paul Gauthier.Aider LLM Leaderboards.https://aider.chat/docs/leaderboards/. 2025
work page 2025
-
[18]
Are we done with mmlu? CoRR, abs/2406.04127,
Aryo Pradipta Gema et al. “Are we done with mmlu?” In:arXiv preprint arXiv:2406.04127(2024)
-
[19]
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Alex Gu et al. “Cruxeval: A benchmark for code reasoning, understanding and execution”. In:arXiv preprint arXiv:2401.03065(2024)
work page internal anchor Pith review arXiv 2024
-
[20]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo et al. “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning”. In:arXiv preprint arXiv:2501.12948(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Zhicheng Guo et al. “StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models”. In:arXiv preprint arXiv:2403.07714(2025)
-
[22]
PipeDream: Fast and Efficient Pipeline Parallel DNN Training
Aaron Harlap et al. “Pipedream: Fast and efficient pipeline parallel dnn training”. In:arXiv preprint arXiv:1806.03377(2018)
work page Pith review arXiv 2018
-
[23]
Chinese simpleqa: A chinese factuality evaluation for large language models, 2024a
Y He et al. “Chinese simpleqa: A chinese factuality evaluation for large language models, 2024a”. In:URL https://arxiv. org/abs/2411.07140()
-
[24]
Measuring Massive Multitask Language Understanding
Dan Hendrycks et al. “Measuring massive multitask language understanding”. In:arXiv preprint arXiv:2009.03300(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[25]
Dan Hendrycks et al.Measuring Mathematical Problem Solving With the MATH Dataset. 2021. arXiv: 2103. 03874 [cs.LG].URL:https://arxiv.org/abs/2103.03874
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Shengding Hu et al. “Minicpm: Unveiling the potential of small language models with scalable training strategies”. In:arXiv preprint arXiv:2404.06395(2024). 21 Kimi K2TECHNICALREPORT
work page internal anchor Pith review arXiv 2024
- [27]
- [28]
-
[29]
Gpipe: Efficient training of giant neural networks using pipeline parallelism
Yanping Huang et al. “Gpipe: Efficient training of giant neural networks using pipeline parallelism”. In:Advances in neural information processing systems32 (2019)
work page 2019
-
[30]
Yuzhen Huang et al.C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
- [31]
- [32]
-
[33]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain et al. “Livecodebench: Holistic and contamination free evaluation of large language models for code”. In:arXiv preprint arXiv:2403.07974(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
SWE-bench: Can Language Models Resolve Real-world Github Issues?
Carlos E Jimenez et al. “SWE-bench: Can Language Models Resolve Real-world Github Issues?” In:The Twelfth International Conference on Learning Representations. 2024.URL: https://openreview.net/forum? id=VTF8yNQM66
work page 2024
-
[35]
2024.URL: https : / / kellerjordan.github.io/posts/muon/
Keller Jordan et al.Muon: An optimizer for hidden layers in neural networks. 2024.URL: https : / / kellerjordan.github.io/posts/muon/
work page 2024
-
[36]
Mandar Joshi et al.TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
-
[37]
arXiv:1705.03551 [cs.CL].URL:https://arxiv.org/abs/1705.03551
work page internal anchor Pith review arXiv
-
[38]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team. “Kimi k1. 5: Scaling reinforcement learning with llms”. In:arXiv preprint arXiv:2501.12599(2025)
work page internal anchor Pith review arXiv 2025
-
[39]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In:3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015.URL: http://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[40]
Satyapriya Krishna et al.Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
- [41]
-
[42]
Breadth-first pipeline parallelism
Joel Lamy-Poirier. “Breadth-first pipeline parallelism”. In:Proceedings of Machine Learning and Systems5 (2023), pp. 48–67
work page 2023
-
[43]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin et al. “Gshard: Scaling giant models with conditional computation and automatic sharding”. In: arXiv preprint arXiv:2006.16668(2020)
work page internal anchor Pith review arXiv 2006
- [44]
-
[45]
Jia Li et al. “Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions”. In:Hugging Face repository13.9 (2024), p. 9
work page 2024
-
[46]
Tianle Li et al. “From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline”. In:arXiv preprint arXiv:2406.11939(2024)
- [47]
-
[48]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Aixin Liu et al. “Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model”. In: arXiv preprint arXiv:2405.04434(2024)
work page internal anchor Pith review arXiv 2024
-
[49]
Jiawei Liu et al. “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation”. In:Advances in Neural Information Processing Systems36 (2023), pp. 21558–21572
work page 2023
-
[50]
Muon is Scalable for LLM Training
Jingyuan Liu et al. “Muon is scalable for LLM training”. In:arXiv preprint arXiv:2502.16982(2025)
work page internal anchor Pith review arXiv 2025
-
[51]
Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency
Ziming Liu et al. “Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency”. In:Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’23. ACM, Nov. 2023, pp. 1–13.DOI: 10 . 1145 / 3581784 . 3607073.URL: http://dx.doi.org/10.1145/3581784.3607073
-
[52]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. “Decoupled Weight Decay Regularization”. In:International Conference on Learning Representations. 2019.URL:https://openreview.net/forum?id=Bkg6RiCqY7
work page 2019
- [53]
-
[54]
Swe-lancer: Can frontier llms earn $1 million from real world freelance software engineering?
Samuel Miserendino et al. “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?” In:arXiv preprint arXiv:2502.12115(2025)
-
[55]
Agentin- struct: Toward generative teaching with agentic flows,
Arindam Mitra et al. “Agentinstruct: Toward generative teaching with agentic flows”. In:arXiv preprint arXiv:2407.03502(2024). 22 Kimi K2TECHNICALREPORT
-
[56]
Ivan Moshkov et al. “Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset”. In:arXiv preprint arXiv:2504.16891(2025)
-
[57]
Efficient large-scale language model training on gpu clusters using megatron-lm
Deepak Narayanan et al. “Efficient large-scale language model training on gpu clusters using megatron-lm”. In: Proceedings of the international conference for high performance computing, networking, storage and analysis. 2021, pp. 1–15
work page 2021
-
[58]
Training language models to follow instructions with human feedback
Long Ouyang et al. “Training language models to follow instructions with human feedback”. In:Advances in neural information processing systems35 (2022), pp. 27730–27744
work page 2022
-
[59]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng et al. “Yarn: Efficient context window extension of large language models”. In:arXiv preprint arXiv:2309.00071(2023)
work page internal anchor Pith review arXiv 2023
-
[60]
Long Phan et al.Humanity’s Last Exam. 2025. arXiv: 2501.14249 [cs.LG] .URL: https://arxiv. org/abs/2501.14249
work page internal anchor Pith review arXiv 2025
-
[61]
Zero bubble pipeline parallelism.arXiv preprint arXiv:2401.10241, 2023
Penghui Qi et al. “Zero bubble pipeline parallelism”. In:arXiv preprint arXiv:2401.10241(2023)
-
[62]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin et al. “Toolllm: Facilitating large language models to master 16000+ real-world apis”. In:arXiv preprint arXiv:2307.16789(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Qwen et al.Qwen2.5 Technical Report. 2025. arXiv: 2412.15115 [cs.CL] .URL: https://arxiv. org/abs/2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Zero: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari et al. “Zero: Memory optimizations toward training trillion parameter models”. In:SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE. 2020, pp. 1–16
work page 2020
-
[65]
Gpqa: A graduate-level google-proof q&a benchmark
David Rein et al. “Gpqa: A graduate-level google-proof q&a benchmark”. In:First Conference on Language Modeling. 2024
work page 2024
-
[66]
Winogrande: An adversarial winograd schema challenge at scale
Keisuke Sakaguchi et al. “Winogrande: An adversarial winograd schema challenge at scale”. In:Communications of the ACM64.9 (2021), pp. 99–106
work page 2021
-
[67]
Welcome to the era of experience
David Silver and Richard S Sutton. “Welcome to the era of experience”. In:Google AI1 (2025)
work page 2025
-
[68]
Ved Sirdeshmukh et al.MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs. 2025. arXiv: 2501 . 17399 [cs.CL].URL: https : / / arxiv . org / abs / 2501 . 17399
work page 2025
-
[69]
Giulio Starace et al. “PaperBench: Evaluating AI’s Ability to Replicate AI Research”. In:arXiv preprint arXiv:2504.01848(2025)
- [70]
-
[71]
Mirac Suzgun et al.Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. 2022. arXiv: 2210.09261 [cs.CL].URL:https://arxiv.org/abs/2210.09261
work page internal anchor Pith review arXiv 2022
-
[72]
Manveer Singh Tamber et al. “Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards”. In:arXiv preprint arXiv:2505.04847(2025)
-
[73]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team et al. “Gemma 2: Improving open language models at a practical size”. In:arXiv preprint arXiv:2408.00118(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
https://ai.meta.com/blog/llama-4-multimodal-intelligence/
LlaMA Team.The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation — ai.meta.com. https://ai.meta.com/blog/llama-4-multimodal-intelligence/. [Accessed 15-07-2025]
work page 2025
-
[75]
The Terminal-Bench Team.Terminal-Bench: A Benchmark for AI Agents in Terminal Environments. Apr. 2025. URL:https://github.com/laude-institute/terminal-bench
work page 2025
-
[76]
Ashish Vaswani et al. “Attention is All you Need”. In:Advances in Neural Information Processing Systems. Ed. by I. Guyon et al. V ol. 30. Curran Associates, Inc., 2017.URL: https://proceedings.neurips. cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
work page 2017
-
[77]
2024.URL: https://huggingface.co/ vectara/hallucination_evaluation_model
Vectara.Hallucination Evaluation Model (Revision 7437011). 2024.URL: https://huggingface.co/ vectara/hallucination_evaluation_model
work page 2024
-
[78]
In Advances in Neural Information Processing Systems
Joshua Vendrow et al. “Do large language model benchmarks test reliability?” In:arXiv preprint arXiv:2502.03461(2025)
-
[79]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang et al. “Self-instruct: Aligning language models with self-generated instructions”. In:arXiv preprint arXiv:2212.10560(2022)
work page internal anchor Pith review arXiv 2022
-
[80]
Yubo Wang et al.MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.