Recognition: no theorem link
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3
The pith
Scaling reinforcement learning with long context and policy optimization lets LLMs match top reasoning performance on math and code benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a simplistic RL framework for multi-modal LLMs, built on long context scaling and enhanced policy optimization without Monte Carlo tree search, value functions, or process reward models, achieves state-of-the-art reasoning performance across benchmarks and modalities, with scores such as 77.5 on AIME, 96.2 on MATH 500, 94th percentile on Codeforces, and 74.9 on MathVista matching OpenAI o1. The framework also includes effective long2short methods that transfer gains from long-CoT training to short-CoT models, yielding 60.8 on AIME, 94.6 on MATH 500, and 47.3 on LiveCodeBench.
What carries the argument
The RL training framework that scales long context and applies improved policy optimization to let models learn from rewards on extended sequences.
If this is right
- Reinforcement learning can serve as a primary scaling method for reasoning once infrastructure supports it.
- Multi-modal data combined with RL improves performance on both text and vision reasoning tasks.
- Long-CoT training can be distilled into stronger short-CoT models that run at lower inference cost.
- Simple policy optimization suffices for competitive results, removing the need for tree search or separate value models.
Where Pith is reading between the lines
- If the framework generalizes, RL compute could replace or complement data scaling as the main driver of capability growth in reasoning domains.
- The long-to-short transfer suggests a practical way to improve deployed models without changing their inference length.
- Testing the same methods on non-reasoning tasks such as tool use or planning would clarify the scope of the gains.
- Infrastructure optimizations for long-context RL may become a key bottleneck if the approach is widely adopted.
Load-bearing premise
The reported benchmark gains come primarily from the described RL techniques rather than from undisclosed differences in model scale, data quality, or evaluation protocols.
What would settle it
A controlled reproduction that applies the exact long-context RL and policy optimization methods to a public model and fails to reach the stated benchmark thresholds would show the gains do not generalize from the reported runs.
read the original abstract
Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Kimi k1.5, a multi-modal LLM trained via reinforcement learning. It identifies long-context scaling and improved policy optimization as the core of a simple RL framework that avoids MCTS, value functions, and process reward models. The work reports state-of-the-art reasoning results matching o1 (77.5 AIME, 96.2 MATH-500, 94th percentile Codeforces, 74.9 MathVista) and introduces long2short distillation techniques that yield strong short-CoT performance (60.8 AIME, 94.6 MATH-500, 47.3 LiveCodeBench), outperforming GPT-4o and Claude 3.5 Sonnet by large margins.
Significance. If the performance gains are causally attributable to the described RL ingredients rather than scale or data differences, the result would be significant: it would demonstrate that competitive reasoning can be obtained from a comparatively simple RL recipe, supporting the broader thesis that RL provides a new scaling axis beyond next-token prediction. The long2short method would additionally offer a practical route to efficient short-CoT models.
major comments (3)
- [Abstract] Abstract and methods description: the central claim that long-context scaling plus improved policy optimization constitutes a 'simplistic, effective RL framework' responsible for o1-level performance cannot be evaluated, because the manuscript supplies no information on base-model parameter count, total RL tokens or steps, base-model identity, or training data composition. Without these quantities, attribution of the reported scores (77.5 AIME, 96.2 MATH-500, etc.) to the stated techniques versus undisclosed scale or data advantages is impossible.
- [Abstract] Abstract and results sections: no ablation studies or controlled comparisons are presented that isolate the contribution of the long-context scaling and policy-optimization changes from other factors (e.g., data quality, sampling budget, or post-training tricks). The absence of such experiments leaves the 'simple framework' conclusion untestable.
- [Abstract] Evaluation description: benchmark numbers are given without specification of the evaluation protocol (temperature, number of samples, few-shot prompts, or whether results are single-run or averaged), which is required for reproducible comparison to o1 and other models.
minor comments (1)
- [Abstract] The '+550%' improvement claim in the abstract should explicitly identify the baseline model and metric to avoid ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point by point below, offering the strongest honest defense of the manuscript while acknowledging its limitations. We commit to revisions for improved clarity and reproducibility where possible.
read point-by-point responses
-
Referee: [Abstract] Abstract and methods description: the central claim that long-context scaling plus improved policy optimization constitutes a 'simplistic, effective RL framework' responsible for o1-level performance cannot be evaluated, because the manuscript supplies no information on base-model parameter count, total RL tokens or steps, base-model identity, or training data composition. Without these quantities, attribution of the reported scores to the stated techniques versus undisclosed scale or data advantages is impossible.
Authors: We agree that full disclosure of base-model size, exact RL token counts, steps, and data composition would allow stronger causal attribution. However, these details are proprietary and cannot be released. The manuscript positions the contribution as demonstrating that long-context scaling combined with improved policy optimization yields o1-level results without MCTS, value functions, or process reward models. The base model is a continuation of the prior Kimi series. We will add an explicit statement noting that scale and data details are withheld for competitive reasons, while emphasizing the framework's simplicity as evidenced by the achieved performance. revision: partial
-
Referee: [Abstract] Abstract and results sections: no ablation studies or controlled comparisons are presented that isolate the contribution of the long-context scaling and policy-optimization changes from other factors (e.g., data quality, sampling budget, or post-training tricks). The absence of such experiments leaves the 'simple framework' conclusion untestable.
Authors: We acknowledge the absence of explicit ablations isolating long-context scaling and policy optimization from data quality or other factors. At the scale of these training runs, controlled ablations are computationally prohibitive and were not performed. The paper reports the end-to-end results of the full system and highlights that competitive performance is obtained without complex auxiliary components. We will expand the discussion to explain the practical constraints on ablations and note that the framework's effectiveness is supported by the overall benchmark outcomes. revision: partial
-
Referee: [Abstract] Evaluation description: benchmark numbers are given without specification of the evaluation protocol (temperature, number of samples, few-shot prompts, or whether results are single-run or averaged), which is required for reproducible comparison to o1 and other models.
Authors: This observation is correct and we will address it. We will revise the evaluation section to specify the protocol, including temperature settings (typically 0 for deterministic inference on reasoning benchmarks), sampling details if used, standard few-shot prompts from each benchmark, and confirmation that reported numbers reflect single runs or averages as applicable. revision: yes
- Exact base-model parameter count, total RL tokens or steps, base-model identity specifics, and training data composition (proprietary)
- New ablation studies or controlled experiments isolating individual contributions (computationally prohibitive at this scale)
Circularity Check
No derivation chain or equations present; empirical claims only
full rationale
The manuscript is a technical report on RL training practices, infrastructure, and benchmark results for Kimi k1.5. It contains no equations, first-principles derivations, fitted parameters presented as predictions, or load-bearing self-citations of uniqueness theorems. Claims of SOTA performance (e.g., 77.5 AIME) are attributed to long-context scaling and policy optimization but are not derived mathematically from prior steps within the paper; they are reported outcomes. No self-definitional loops, ansatz smuggling, or renaming of known results occur because no formal derivation exists to inspect. The paper is self-contained as an empirical description against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-train...
-
FIND: Toward Multimodal Financial Reasoning and Question Answering for Indic Languages
FinVQA is a new multilingual benchmark for Indic financial VQA with three difficulty levels and four formats, paired with the FIND framework for faithful numerical reasoning via fine-tuning and constrained decoding.
-
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.
-
Unsupervised Process Reward Models
Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
-
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...
-
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
-
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
-
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
-
Long Context Pre-Training with Lighthouse Attention
Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...
-
Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
-
Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling
Unstructured pruning augments test-time scaling reasoning performance in LLMs and can outperform the unpruned model on benchmarks, contrary to expectations from structured pruning studies.
-
Stabilizing Efficient Reasoning with Step-Level Advantage Selection
SAS stabilizes efficient LLM reasoning by step-level advantage masking, improving Pass@1 accuracy by 0.86 points and cutting reasoning length by 16.3% versus length-aware baselines.
-
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
-
AI Achieves a Perfect LSAT Score
Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.
-
Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR
AsymGRPO refines policy entropy in RLVR by preserving informative entropy on positive rollouts and suppressing spurious entropy on negative ones, outperforming baselines.
-
User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation
SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.
-
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
-
Group-in-Group Policy Optimization for LLM Agent Training
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
-
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
-
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
-
Teacher-Guided Policy Optimization for LLM Distillation
TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
-
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes
STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
-
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
-
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
-
Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning
ProFIL trains an activation probe on a frozen base model to zero advantages on theatrical post-commitment rollouts in GRPO, cutting theater 11-100%, raising faithful fractions, and shortening chains 4-19% without accu...
-
PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning
PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.
-
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
-
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
-
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
-
Hint Tuning: Less Data Makes Better Reasoners
Hint Tuning uses an instruct model as a difficulty probe to create 1K multi-level hint examples that train reasoning models to calibrate chain-of-thought length, cutting tokens by 31.5% on average across 4B-32B models...
-
AIPO: : Learning to Reason from Active Interaction
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
-
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
-
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and know...
-
Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs
ASTOR improves a single code LLM across four tasks by 9.0-9.5% over the best specialist and 7.5-12.8% over prior multi-task RL baselines via utility-driven data scheduling and adaptive KL regularization.
-
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
-
T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.
-
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
-
Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling
LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.
-
DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training
DORA's multi-version streaming rollout enables 2-3x higher throughput in asynchronous RL for LLMs while preserving convergence by maintaining policy consistency, data integrity, and bounded staleness.
-
Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models
RL generalizes better than SFT by preserving and slowly evolving a compact set of task-agnostic features from the base model rather than introducing many specialized ones.
-
ViPO: Visual Preference Optimization at Scale
Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.
-
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
-
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.
-
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
-
WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning
WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much lar...
-
Reasoning Structure Matters for Safety Alignment of Reasoning Models
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
-
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning
Step-GRPO internalizes dynamic early exit into reasoning models via step-structured optimization, Dynamic Truncated Rollout, and Step-Aware Relative Reward, delivering 32% token reduction on Qwen3-8B with no accuracy loss.
-
LLMs Corrupt Your Documents When You Delegate
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
-
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
ReSS uses decision-tree scaffolds to fine-tune LLMs for faithful tabular reasoning, reporting up to 10% gains over baselines on medical and financial data.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.