BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.
International Conference on Machine Learning , year=
2 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
citing papers explorer
-
Milestone-Guided Policy Learning for Long-Horizon Language Agents
BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.
-
Process Reinforcement through Implicit Rewards
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.