hub Canonical reference

SimpleTIR: End-to-end reinforcementlearningformulti-turntool-integratedreasoning.arXivpreprint

Junhao Wei, Chao Zheng, Chengming Zhang, Xiao Cheng, Yufei Wu, et al · 2025 · arXiv 2509.02479

Canonical reference. 83% of citing Pith papers cite this work as background.

19 Pith papers citing it

Background 83% of classified citations

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 1

citation-polarity summary

background 5 baseline 1

representative citing papers

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interventions that enhance performance.

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

cs.CV · 2026-01-11 · unverdicted · novelty 7.0

VideoDR is a new benchmark for open-web video deep research that tests multimodal models on cross-frame visual anchor extraction, interactive retrieval, and multi-hop reasoning over joint video-web evidence.

SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

cs.AI · 2026-01-07 · unverdicted · novelty 7.0

SCRIBE introduces skill-conditioned rewards with intermediate behavioral evaluation to reduce noise in training tool-augmented agents, raising AIME25 accuracy from 43.3% to 63.3% on a Qwen3-4B model.

Training Multi-Image Vision Agents via End2End Reinforcement Learning

cs.CV · 2025-12-05 · unverdicted · novelty 7.0

IMAgent trains a multi-image vision agent via pure end-to-end RL with visual reflection tools and a two-layer motion trajectory masking strategy, reaching SOTA on single- and multi-image benchmarks while revealing tool-use effects on attention.

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.

Harnessing LLM Agents with Skill Programs

cs.AI · 2026-05-18 · conditional · novelty 6.0

HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.

PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.

TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

cs.AI · 2026-05-10 · unverdicted · novelty 6.0

TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.

SOD: Step-wise On-policy Distillation for Small Language Model Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.

T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.

When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

cs.CL · 2026-04-09 · unverdicted · novelty 6.0

ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.

AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.

Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.

E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning

cs.AI · 2026-04-10 · unverdicted · novelty 5.0

E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

cs.AI · 2026-01-29 · unverdicted · novelty 5.0

MemOCR renders structured memory as images with adaptive visual density to improve long-horizon reasoning under tight context budgets.

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

cs.CV · 2025-09-09 · unverdicted · novelty 5.0

Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.

The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents

cs.AI · 2026-05-11 · unverdicted · novelty 4.0

Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.

citing papers explorer

Showing 19 of 19 citing papers.

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits cs.LG · 2026-05-09 · unverdicted · none · ref 26
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interventions that enhance performance.
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective cs.LG · 2026-05-08 · unverdicted · none · ref 46
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.
Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning cs.CV · 2026-01-11 · unverdicted · none · ref 15
VideoDR is a new benchmark for open-web video deep research that tests multimodal models on cross-frame visual anchor extraction, interactive retrieval, and multi-hop reasoning over joint video-web evidence.
SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models cs.AI · 2026-01-07 · unverdicted · none · ref 6
SCRIBE introduces skill-conditioned rewards with intermediate behavioral evaluation to reduce noise in training tool-augmented agents, raising AIME25 accuracy from 43.3% to 63.3% on a Qwen3-4B model.
Training Multi-Image Vision Agents via End2End Reinforcement Learning cs.CV · 2025-12-05 · unverdicted · none · ref 53
IMAgent trains a multi-image vision agent via pure end-to-end RL with visual reflection tools and a two-layer motion trajectory masking strategy, reaching SOTA on single- and multi-image benchmarks while revealing tool-use effects on attention.
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning cs.AI · 2026-05-21 · unverdicted · none · ref 103
SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
Harnessing LLM Agents with Skill Programs cs.AI · 2026-05-18 · conditional · none · ref 37
HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.
PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning cs.CL · 2026-05-11 · unverdicted · none · ref 29
PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 13
TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
SOD: Step-wise On-policy Distillation for Small Language Model Agents cs.CL · 2026-05-08 · unverdicted · none · ref 15
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning cs.CL · 2026-05-07 · unverdicted · none · ref 40
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.
T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning cs.AI · 2026-05-04 · unverdicted · none · ref 23
T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning cs.CL · 2026-04-09 · unverdicted · none · ref 25
ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning cs.CV · 2026-04-09 · unverdicted · none · ref 36
AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization cs.CV · 2026-04-08 · unverdicted · none · ref 52
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning cs.AI · 2026-04-10 · unverdicted · none · ref 32
E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.
MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning cs.AI · 2026-01-29 · unverdicted · none · ref 28
MemOCR renders structured memory as images with adaptive visual density to improve long-horizon reasoning under tight context budgets.
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search cs.CV · 2025-09-09 · unverdicted · none · ref 42
Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents cs.AI · 2026-05-11 · unverdicted · none · ref 43
Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.

SimpleTIR: End-to-end reinforcementlearningformulti-turntool-integratedreasoning.arXivpreprint

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer