CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

Guanjun Jiang; Guohao Sun; Kevin Zhang; Rujun Guo; Shaoyu Liu; Shijie Zhang; Shiyu Liu; Wangxiao Zhao; Xiang Guo; Zheng Xiao

arxiv: 2509.25004 · v2 · pith:HGXRU3A3new · submitted 2025-09-29 · 💻 cs.AI

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

Shijie Zhang , Zheng Xiao , Shiyu Liu , Guohao Sun , Kevin Zhang , Xiang Guo , Rujun Guo , Shaoyu Liu

show 2 more authors

Wangxiao Zhao Guanjun Jiang

This is my paper

classification 💻 cs.AI

keywords reasoningclpocurriculumlearningproblemspolicyaccuracybecome

0 comments

read the original abstract

Online reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning abilities of large language models, but most methods still optimize reasoning trajectories over the static problem set, wasting rollout budget on solved or overly difficult problems. We propose \textbf{CLPO (Curriculum Learning meets Policy Optimization)}, a self-evolving curriculum framework that uses on-policy rollout accuracy to identify solved, medium-difficulty, and hard problems, then restructures selected tasks according to the model's current capability. Hard problems are simplified to become learnable, while medium-difficulty problems are diversified to provide useful training variation. This allows the learning curriculum to co-evolve with the policy rather than remaining fixed as the model's capability boundary shifts. Rather than treating these rewrites as static data augmentation, CLPO optimizes restructuring trajectories with credit assigned by the downstream accuracy gain of the rewritten problem, requiring no additional human annotations beyond the original verifiable answers. Experiments across mathematical reasoning and out-of-domain general reasoning benchmarks show that CLPO substantially outperforms GRPO and DAPO on Qwen3-8B by 10.21 and 7.75 average points, respectively. Ablation studies on math and code domains further show that both the restructuring mode and the rewriting loss contribute to the final gains, demonstrating that CLPO provides a scalable and robust pathway for eliciting stronger reasoning capabilities through a self-evolving curriculum.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
cs.CL 2026-06 unverdicted novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite w...
ENVS: Environment-Native Verified Search for Long-Horizon GUI Agents
cs.AI 2026-06 unverdicted novelty 6.0

ENVS generates verified supervision via environment-native search in OSWorld VMs to train GUI agents, reaching 30.3 pass@8 on 300 tasks while using less compute than ARPO baselines and introducing OSWorld-Noisy for in...