CARL trains a critic for segment-level credit assignment from binary outcomes in LLM tool-use trajectories, yielding 6.7-9.7 point accuracy gains and 53% fewer calls on solvable questions across five benchmarks.
InProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1417–1436, Abu Dhabi, United Arab Emirates
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
AdaptR1 uses fully RL-based training with a quality-gated efficiency reward for step-wise adaptive reasoning in multi-hop QA, reducing think tokens by 69.71% on average and 90.35% on HotpotQA with comparable or better performance.
ThinkGR interleaves chain-of-thought with docid generation using hybrid decoding and two-phase training to achieve state-of-the-art results on multi-hop retrieval benchmarks.
citing papers explorer
-
Knowing When to Ask: Segment-Level Credit Assignment for LLM Tool Use
CARL trains a critic for segment-level credit assignment from binary outcomes in LLM tool-use trajectories, yielding 6.7-9.7 point accuracy gains and 53% fewer calls on solvable questions across five benchmarks.
-
AdaptR1: Reinforcement Learning Based Adaptive Interleaved Thinking in Multi-hop Question Answering
AdaptR1 uses fully RL-based training with a quality-gated efficiency reward for step-wise adaptive reasoning in multi-hop QA, reducing think tokens by 69.71% on average and 90.35% on HotpotQA with comparable or better performance.
-
Integrating Chain-of-Thought into Generative Retrieval: A Preliminary Study
ThinkGR interleaves chain-of-thought with docid generation using hybrid decoding and two-phase training to achieve state-of-the-art results on multi-hop retrieval benchmarks.