hub

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116

Api-bank: A comprehensive benchmark for tool-augmented llms · 2023 · arXiv 2502.11886

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.

Gradient Extrapolation-Based Policy Optimization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.

Cost-Aware Learning

cs.LG · 2026-04-30 · unverdicted · novelty 6.0

Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.

POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

POCA combines Pareto optimization with curriculum alignment to improve multi-reward reinforcement learning for visual text generation without relying on weighted sums.

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

Can LLMs Learn to Reason Robustly under Noisy Supervision?

cs.LG · 2026-04-05 · conditional · novelty 6.0

Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.

ToolRL: Reward is All Tool Learning Needs

cs.LG · 2025-04-16 · conditional · novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

cs.AI · 2026-04-20 · unverdicted · novelty 5.0

Mixed-complexity procedural datasets provide up to 5x sample efficiency for RLVR on small models in low-data regimes, with low-to-high complexity generalization observed across counting, graph, and spatial tasks.

TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning

eess.SP · 2026-04-18 · unverdicted · novelty 5.0

TimeRFT applies reinforcement learning with multi-faceted step-wise rewards and informative sample selection to improve generalization and accuracy in TSFM adaptation beyond supervised fine-tuning.

From System 1 to System 2: A Survey of Reasoning Large Language Models

cs.AI · 2025-02-24 · accept · novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

citing papers explorer

Showing 11 of 11 citing papers.

DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-08 · unverdicted · none · ref 19
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation cs.LG · 2026-05-09 · unverdicted · none · ref 28
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
Gradient Extrapolation-Based Policy Optimization cs.LG · 2026-05-07 · unverdicted · none · ref 16
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.
Cost-Aware Learning cs.LG · 2026-04-30 · unverdicted · none · ref 11
Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.
POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation cs.CV · 2026-04-27 · unverdicted · none · ref 18
POCA combines Pareto optimization with curriculum alignment to improve multi-reward reinforcement learning for visual text generation without relying on weighted sums.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment cs.LG · 2026-04-20 · unverdicted · none · ref 96
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Can LLMs Learn to Reason Robustly under Noisy Supervision? cs.LG · 2026-04-05 · conditional · none · ref 15
Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.
ToolRL: Reward is All Tool Learning Needs cs.LG · 2025-04-16 · conditional · none · ref 17
A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes cs.AI · 2026-04-20 · unverdicted · none · ref 9
Mixed-complexity procedural datasets provide up to 5x sample efficiency for RLVR on small models in low-data regimes, with low-to-high complexity generalization observed across counting, graph, and spatial tasks.
TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning eess.SP · 2026-04-18 · unverdicted · none · ref 29
TimeRFT applies reinforcement learning with multi-faceted step-wise rewards and informative sample selection to improve generalization and accuracy in TSFM adaptation beyond supervised fine-tuning.
From System 1 to System 2: A Survey of Reasoning Large Language Models cs.AI · 2025-02-24 · accept · none · ref 276
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer