pith. machine review for the scientific record. sign in

arxiv: 2510.05837 · v2 · submitted 2025-10-07 · 💻 cs.CL

Recognition: unknown

EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

Bo Han, Hinrich Schutze, Jing Bai, Kam-Fai Wong, Liang Chen, Qizhou Wang, Xueting Han

classification 💻 cs.CL
keywords explorationeepopolicydominantexploitationexploration-enhancedgainsloop
0
0 comments X
read the original abstract

Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy collapse, diminished exploratory capacity, and ultimately limited performance gains. Although techniques that increase policy stochasticity can promote exploration, they frequently fail to escape dominant behavioral modes. This creates a self-reinforcing loop -- repeatedly sampling and rewarding dominant modes -- that further erodes exploration. We introduce Exploration-Enhanced Policy Optimization (EEPO), a framework that promotes exploration via two-stage rollouts with adaptive unlearning. In the first stage, the model generates half of the trajectories; it then undergoes a lightweight unlearning step to temporarily suppress these sampled responses, forcing the second stage to explore different regions of the output space. This sample-then-forget mechanism disrupts the self-reinforcing loop and promotes wider exploration during rollouts. Across five reasoning benchmarks, EEPO outperforms GRPO, achieving average relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...

  2. The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

    cs.LG 2026-04 unverdicted novelty 6.0

    MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density c...