Tool-use agents suffer large accuracy drops from reward and transition perturbations but domain-randomized RL on static perturbations closes about 27% of the unseen transition gap while retaining most clean performance.
Mua-rl: Multi-turn user-interacting agent reinforcement learning for agentic tool use
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6verdicts
UNVERDICTED 6roles
background 1polarities
background 1representative citing papers
SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
PROVE trains LLMs on multi-step tool calls using 20 live MCP servers with 343 tools, state-grounded synthesis, and adaptive efficiency rewards, delivering gains of up to 10.2 points on BFCL Multi-Turn and similar on other benchmarks.
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
citing papers explorer
-
When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents
Tool-use agents suffer large accuracy drops from reward and transition perturbations but domain-randomized RL on static perturbations closes about 27% of the unseen transition gap while retaining most clean performance.