OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.
arXiv preprint arXiv:2602.03025 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5representative citing papers
InfoTree casts intermediate state selection in tree search as monotone submodular maximization under fixed rollout budgets, yielding closed-form UUCB terms and lifting mixed-outcome ratios while outperforming flat GRPO and prior tree variants on nine benchmarks.
SCORP delivers 10-28% gains in safety and 2-7% in efficiency metrics on WOMD by using dual-path scene conditioning in diffusion planning plus variance-gated group-relative policy optimization for closed-loop stability.
Skill-R1 applies bi-level group-relative policy optimization to evolve skills recurrently from verified outcomes, yielding gains over baselines on multi-step tasks.
Group-mean centering in binary-reward GRPO produces gradient starvation; the fixed sign advantage A=2r-1 raises GSM8K accuracy from 28.4% to 73.8% at group size 4.
citing papers explorer
-
OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents
OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.
-
Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning
InfoTree casts intermediate state selection in tree search as monotone submodular maximization under fixed rollout budgets, yielding closed-form UUCB terms and lifting mixed-outcome ratios while outperforming flat GRPO and prior tree variants on nine benchmarks.
-
SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving
SCORP delivers 10-28% gains in safety and 2-7% in efficiency metrics on WOMD by using dual-path scene conditioning in diffusion planning plus variance-gated group-relative policy optimization for closed-loop stability.
-
Skill-R1: Agent Skill Evolution via Reinforcement Learning
Skill-R1 applies bi-level group-relative policy optimization to evolve skills recurrently from verified outcomes, yielding gains over baselines on multi-step tasks.
-
Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works
Group-mean centering in binary-reward GRPO produces gradient starvation; the fixed sign advantage A=2r-1 raises GSM8K accuracy from 28.4% to 73.8% at group size 4.