Tool-use agents suffer large accuracy drops from reward and transition perturbations but domain-randomized RL on static perturbations closes about 27% of the unseen transition gap while retaining most clean performance.
Mua-rl: Multi-turn user-interacting agent reinforcement learning for agentic tool use
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
citing papers explorer
-
When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents
Tool-use agents suffer large accuracy drops from reward and transition perturbations but domain-randomized RL on static perturbations closes about 27% of the unseen transition gap while retaining most clean performance.
-
SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators
SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.