{"work":{"id":"bff96ab1-bd6a-4585-be23-74fdb51969c7","openalex_id":null,"doi":null,"arxiv_id":"2501.12599","raw_key":null,"title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","authors":null,"authors_text":"Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen","year":2025,"venue":"cs.AI","abstract":"Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).","external_url":"https://arxiv.org/abs/2501.12599","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-14T20:42:58.847497+00:00","pith_arxiv_id":"2501.12599","created_at":"2026-05-08T17:13:38.774581+00:00","updated_at":"2026-05-14T20:42:58.847497+00:00","title_quality_ok":true,"display_title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","render_title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs"},"hub":{"state":{"work_id":"bff96ab1-bd6a-4585-be23-74fdb51969c7","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":97,"external_cited_by_count":null,"distinct_field_count":10,"first_pith_cited_at":"2024-12-30T18:55:12+00:00","last_pith_cited_at":"2026-05-13T15:05:18+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-14T21:56:14.499807+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":1}],"polarity_counts":[{"context_polarity":"background","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T06:17:21.052764+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":53},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":49},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":38},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":36},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":30},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":29},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":20},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":18},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":17},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":17},{"title":"Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters","work_id":"a8d50b24-bdf5-46ed-bc4f-2927dfd81f1d","shared_citers":17},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":15},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":15},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":14},{"title":"Understanding R1-Zero-Like Training: A Critical Perspective","work_id":"ec354f3b-9484-4a0c-94c8-92d4d0260835","shared_citers":14},{"title":"Tulu 3: Pushing Frontiers in Open Language Model Post-Training","work_id":"28c9dbea-056a-48c2-8000-85f809827e45","shared_citers":12},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":11},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":10},{"title":"Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model","work_id":"763e0e44-40dd-4bdd-8414-21f8f9ce6d10","shared_citers":10},{"title":"Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning","work_id":"0e0b7549-2bc4-4574-aa7f-588ffa16eaae","shared_citers":10},{"title":"VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks","work_id":"c2351652-65f7-47cd-ae80-dbcd72a6eb20","shared_citers":10},{"title":"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?","work_id":"d854765a-e664-41c0-8655-21c4bf2e0cc4","shared_citers":9},{"title":"HybridFlow: A Flexible and Efficient RLHF Framework","work_id":"7eb9c9f4-b322-4bba-8011-09ff8d6ad801","shared_citers":9},{"title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","work_id":"ea9e51ce-1e75-4182-92d8-4d25f70d2ee4","shared_citers":9}],"time_series":[{"n":1,"year":2024},{"n":17,"year":2025},{"n":74,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T06:27:20.648471+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T06:17:16.997895+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","claims":[{"claim_text":"Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Kimi k1.5: Scaling Reinforcement Learning with LLMs because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T06:27:08.419168+00:00"}},"summary":{"title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","claims":[{"claim_text":"Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Kimi k1.5: Scaling Reinforcement Learning with LLMs because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":53},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":49},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":38},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":36},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":30},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":29},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":20},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":18},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":17},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":17},{"title":"Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters","work_id":"a8d50b24-bdf5-46ed-bc4f-2927dfd81f1d","shared_citers":17},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":15},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":15},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":14},{"title":"Understanding R1-Zero-Like Training: A Critical Perspective","work_id":"ec354f3b-9484-4a0c-94c8-92d4d0260835","shared_citers":14},{"title":"Tulu 3: Pushing Frontiers in Open Language Model Post-Training","work_id":"28c9dbea-056a-48c2-8000-85f809827e45","shared_citers":12},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":11},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":10},{"title":"Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model","work_id":"763e0e44-40dd-4bdd-8414-21f8f9ce6d10","shared_citers":10},{"title":"Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning","work_id":"0e0b7549-2bc4-4574-aa7f-588ffa16eaae","shared_citers":10},{"title":"VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks","work_id":"c2351652-65f7-47cd-ae80-dbcd72a6eb20","shared_citers":10},{"title":"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?","work_id":"d854765a-e664-41c0-8655-21c4bf2e0cc4","shared_citers":9},{"title":"HybridFlow: A Flexible and Efficient RLHF Framework","work_id":"7eb9c9f4-b322-4bba-8011-09ff8d6ad801","shared_citers":9},{"title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","work_id":"ea9e51ce-1e75-4182-92d8-4d25f70d2ee4","shared_citers":9}],"time_series":[{"n":1,"year":2024},{"n":17,"year":2025},{"n":74,"year":2026}],"dependency_candidates":[]},"authors":[]}}