TACO combines Differential Answer-Probe Reward (DAPR) and Outcome-Gated Advantage Routing (OGAR) to assign credit to tool calls in agentic visual reasoning, producing accuracy gains on multimodal benchmarks.
Hisr: Hindsight information modulated segmental process rewards for multi-turn agentic reinforcement learning.arXiv preprint arXiv:2603.18683, 2026
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
HIPIF trains LLM agents end-to-end using subgoal-based hierarchical planning and information folding of completed histories, plus hierarchical reflection and process rewards, to handle long-horizon tasks without auxiliary models or expert trajectories.
citing papers explorer
-
TACO: Tool-Augmented Credit Optimization for Agentic Tool Use
TACO combines Differential Answer-Probe Reward (DAPR) and Outcome-Gated Advantage Routing (OGAR) to assign credit to tool calls in agentic visual reasoning, producing accuracy gains on multimodal benchmarks.
-
HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning
HIPIF trains LLM agents end-to-end using subgoal-based hierarchical planning and information folding of completed histories, plus hierarchical reflection and process rewards, to handle long-horizon tasks without auxiliary models or expert trajectories.