{"work":{"id":"d4b4aee4-d20f-4572-886a-4ba9ea6c9b81","openalex_id":null,"doi":null,"arxiv_id":"2505.22617","raw_key":null,"title":"The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models","authors":null,"authors_text":"Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo","year":2025,"venue":"cs.LG","abstract":"This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage when using Policy Gradient-like algorithms. Empirical study shows that, the values of covariance term and entropy differences matched exactly, supporting the theoretical conclusion. Moreover, the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically. Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. Specifically, we propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively. Experiments show that these methods encourage exploration, thus helping policy escape entropy collapse and achieve better downstream performance.","external_url":"https://arxiv.org/abs/2505.22617","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-27T01:20:20.573078+00:00","pith_arxiv_id":"2505.22617","created_at":"2026-05-10T05:25:55.268387+00:00","updated_at":"2026-06-27T01:20:20.573078+00:00","title_quality_ok":true,"display_title":"The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models","render_title":"The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models"},"hub":{"state":{"work_id":"d4b4aee4-d20f-4572-886a-4ba9ea6c9b81","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":72,"external_cited_by_count":null,"distinct_field_count":6,"first_pith_cited_at":"2025-03-12T17:35:03+00:00","last_pith_cited_at":"2026-06-16T15:55:28+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-27T10:45:56.444838+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":23},{"context_role":"baseline","n":1},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":20},{"context_polarity":"support","n":2},{"context_polarity":"baseline","n":1},{"context_polarity":"unclear","n":1},{"context_polarity":"use_method","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T18:20:17.061999+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":28},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":22},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":22},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":19},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":17},{"title":"Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning","work_id":"e5e936f3-0cff-4732-b394-f607d7a63f5f","shared_citers":13},{"title":"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?","work_id":"d854765a-e664-41c0-8655-21c4bf2e0cc4","shared_citers":12},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":12},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":10},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":9},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":9},{"title":"Understanding R1-Zero-Like Training: A Critical Perspective","work_id":"ec354f3b-9484-4a0c-94c8-92d4d0260835","shared_citers":9},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":8},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":7},{"title":"Kimi K2: Open Agentic Intelligence","work_id":"7f18284c-12d3-4137-bea1-1da97e8cf3c1","shared_citers":7},{"title":"Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model","work_id":"763e0e44-40dd-4bdd-8414-21f8f9ce6d10","shared_citers":7},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":7},{"title":"Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models","work_id":"e7962a8e-fc74-4d96-9a1b-3c7897f6c60d","shared_citers":6},{"title":"Process Reinforcement through Implicit Rewards","work_id":"c31a2126-86f9-44f3-91f3-208d0fc1463a","shared_citers":6},{"title":"SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild","work_id":"94a68437-02e7-425a-91b2-5846ddcbd38c","shared_citers":6},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":5},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":5},{"title":"HybridFlow: A Flexible and Efficient RLHF Framework","work_id":"7eb9c9f4-b322-4bba-8011-09ff8d6ad801","shared_citers":5},{"title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","work_id":"bff96ab1-bd6a-4585-be23-74fdb51969c7","shared_citers":5}],"time_series":[{"n":4,"year":2025},{"n":30,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T18:19:37.707955+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T18:19:56.153893+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models","claims":[{"claim_text":"This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus b","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"[145] Yingqian Cui, Pengfei He, Jingying Zeng, Hui Liu, Xianfeng Tang, Zhenwei Dai, Yan Han, Chen Luo, Jing Huang, Zhen Li, et al. Stepwise perplexity-guided refinement for efficient chain-of-thought reasoning in large language models. arXiv preprint arXiv:2502.13260, 2025. [146] Yu Cui and Cong Zuo. Practical reasoning interruption attacks on reasoning large language models. arXiv preprint arXiv:2505.06643, 2025. [147] Yu Cui, Bryan Hooi, Yujun Cai, and Yiwei Wang. Process or result? manipulate","claim_type":"background","confidence":0.7,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (1 contexts).","role_counts":[{"n":1,"context_role":"background"}]},"error":null,"updated_at":"2026-05-14T18:19:46.859759+00:00"}},"summary":{"title":"The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models","claims":[{"claim_text":"This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus b","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"[145] Yingqian Cui, Pengfei He, Jingying Zeng, Hui Liu, Xianfeng Tang, Zhenwei Dai, Yan Han, Chen Luo, Jing Huang, Zhen Li, et al. Stepwise perplexity-guided refinement for efficient chain-of-thought reasoning in large language models. arXiv preprint arXiv:2502.13260, 2025. [146] Yu Cui and Cong Zuo. Practical reasoning interruption attacks on reasoning large language models. arXiv preprint arXiv:2505.06643, 2025. [147] Yu Cui, Bryan Hooi, Yujun Cai, and Yiwei Wang. Process or result? manipulate","claim_type":"background","confidence":0.7,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (1 contexts).","role_counts":[{"n":1,"context_role":"background"}]},"graph":{"co_cited":[{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":28},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":22},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":22},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":19},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":17},{"title":"Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning","work_id":"e5e936f3-0cff-4732-b394-f607d7a63f5f","shared_citers":13},{"title":"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?","work_id":"d854765a-e664-41c0-8655-21c4bf2e0cc4","shared_citers":12},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":12},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":10},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":9},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":9},{"title":"Understanding R1-Zero-Like Training: A Critical Perspective","work_id":"ec354f3b-9484-4a0c-94c8-92d4d0260835","shared_citers":9},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":8},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":7},{"title":"Kimi K2: Open Agentic Intelligence","work_id":"7f18284c-12d3-4137-bea1-1da97e8cf3c1","shared_citers":7},{"title":"Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model","work_id":"763e0e44-40dd-4bdd-8414-21f8f9ce6d10","shared_citers":7},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":7},{"title":"Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models","work_id":"e7962a8e-fc74-4d96-9a1b-3c7897f6c60d","shared_citers":6},{"title":"Process Reinforcement through Implicit Rewards","work_id":"c31a2126-86f9-44f3-91f3-208d0fc1463a","shared_citers":6},{"title":"SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild","work_id":"94a68437-02e7-425a-91b2-5846ddcbd38c","shared_citers":6},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":5},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":5},{"title":"HybridFlow: A Flexible and Efficient RLHF Framework","work_id":"7eb9c9f4-b322-4bba-8011-09ff8d6ad801","shared_citers":5},{"title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","work_id":"bff96ab1-bd6a-4585-be23-74fdb51969c7","shared_citers":5}],"time_series":[{"n":4,"year":2025},{"n":30,"year":2026}],"dependency_candidates":[]},"authors":[]}}