ReGRPO augments group-relative policy optimization with a reflective data engine that generates ErrorType-Evidence-FixPlan triplets from near-miss tool actions to improve recovery in multimodal agents.
arXiv preprint arXiv:2402.15506 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
TTExplore trains a 7B thinker via task-score RL to infer implicit rules at test time, raising agent success by 14-19 points on five embodied tasks.
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
citing papers explorer
-
ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents
ReGRPO augments group-relative policy optimization with a reflective data engine that generates ErrorType-Evidence-FixPlan triplets from near-miss tool actions to improve recovery in multimodal agents.
-
Test-Time Deep Thinking to Explore Implicit Rules
TTExplore trains a 7B thinker via task-score RL to infer implicit rules at test time, raising agent success by 14-19 points on five embodied tasks.