Efficient Reinforcement Finetuning via Adaptive Curriculum Learning

Jieyu Zhao; Linxin Song; Taiwei Shi; Tianyi Zhou; Yiyang Wu

arxiv: 2504.05520 · v4 · pith:OGPC33FFnew · submitted 2025-04-07 · 💻 cs.LG · cs.CL

Efficient Reinforcement Finetuning via Adaptive Curriculum Learning

Taiwei Shi , Yiyang Wu , Linxin Song , Tianyi Zhou , Jieyu Zhao This is my paper

classification 💻 cs.LG cs.CL

keywords adarftadaptivemodelcurriculumdifficultyfinetuninglearningreinforcement

0 comments

read the original abstract

Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs), but it is often sample- and compute-inefficient, requiring extensive training. In this work, we introduce AdaRFT (Adaptive Curriculum Reinforcement Finetuning), a method that significantly improves the efficiency of RFT through adaptive curriculum learning. AdaRFT dynamically adjusts the difficulty of training problems based on the model's recent reward signals, ensuring that the model consistently trains on tasks that are challenging but solvable. This adaptive sampling strategy accelerates learning by maintaining an optimal difficulty range, avoiding wasted computation on problems that are too easy or too hard. AdaRFT requires only a lightweight extension to standard RFT algorithms like Proximal Policy Optimization (PPO), without modifying the reward function or model architecture. Experiments on competition-level math datasets demonstrate that AdaRFT improves convergence efficiency and reasoning performance. Given problem-level difficulty annotations, AdaRFT reduces RFT training time by up to 2 times across data distributions and model scales, offering a more scalable and effective RFT framework.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
cs.LG 2026-04 unverdicted novelty 7.0

TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
cs.LG 2026-05 unverdicted novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning
cs.CL 2026-03 unverdicted novelty 6.0

CODA uses rollout-based difficulty signals to drive two gates that penalize verbosity on easy instances and promote deliberation on hard ones, cutting token use over 60% on simple tasks while maintaining accuracy.
SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning
cs.AI 2026-01 unverdicted novelty 6.0

SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance
cs.CL 2026-05 unverdicted novelty 5.0

LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.
D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

D²Evo mines medium-difficulty anchors from the current model, trains a Questioner to generate matching questions, and jointly optimizes Solver and Questioner for progressive gains, outperforming baselines on math reas...
TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning
eess.SP 2026-04 unverdicted novelty 5.0

TimeRFT applies reinforcement learning with multi-faceted step-wise rewards and informative sample selection to improve generalization and accuracy in TSFM adaptation beyond supervised fine-tuning.
VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction
cs.LG 2026-02 unverdicted novelty 5.0

VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.