ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Learn hard problems during rl with reference guided fine-tuning
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.
SMEPO applies fine-grained semantic masking to expert guidance in RLVR, turning hard problems into fill-in-the-blank tasks while preserving structure, yielding up to 3.2 point accuracy gains and 4.2x faster training.
citing papers explorer
-
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.