pith. machine review for the scientific record. sign in

arxiv: 2505.10978 · v3 · submitted 2025-05-16 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Group-in-Group Policy Optimization for LLM Agent Training

Authors on Pith no claims yet

Pith reviewed 2026-05-11 09:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Group-in-Group Policy OptimizationLLM agent trainingreinforcement learningcredit assignmentmulti-turn agentspolicy optimizationrelative advantage estimationALFWorld WebShop
0
0 comments X

The pith

GiGPO assigns per-step credit in multi-turn LLM agent training by grouping actions from repeated environment states across trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to solve the credit assignment problem that arises when training LLM agents on long, interactive tasks with sparse or delayed rewards. It extends group-based reinforcement learning by adding a second, finer level of grouping: after computing advantages from whole successful versus unsuccessful trajectories, it retroactively clusters actions that begin from the same observed state and compares them locally. This hierarchical structure supplies both global trajectory quality signals and local step effectiveness signals while remaining critic-free and memory-efficient. A sympathetic reader would care because it promises stronger learning on realistic agent benchmarks without the usual increases in compute or model complexity that have limited prior RL methods for agents.

Core claim

GiGPO introduces a two-level relative advantage estimator in which episode-level macro advantages are calculated from groups of complete trajectories and step-level micro advantages are calculated by identifying repeated anchor states across trajectories and comparing the actions taken from each shared state.

What carries the argument

Anchor state grouping mechanism that retroactively forms step-level groups from identical environment states observed in different trajectories to compute micro relative advantages.

If this is right

  • Achieves performance gains exceeding 12 percent on ALFWorld and 9 percent on WebShop relative to the GRPO baseline.
  • Reaches 42.1 percent accuracy with the 3B model and 47.2 percent with the 7B model on search-augmented QA tasks.
  • Maintains identical GPU memory footprint and LLM rollout procedure with negligible extra wall-clock time.
  • Supplies fine-grained per-step credit signals while retaining critic-free training and stable convergence.
  • pith_inferences

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same-state grouping idea could be applied to other sparse-reward sequential decision domains that contain repeatable observations, such as certain games or simulated robotic tasks.
  • Because the method adds no auxiliary networks or extra rollouts, it may lower the practical barrier to scaling RL-based agent training to larger base LLMs.
  • If state repetition is low in a given domain, the micro-advantage component would contribute little, suggesting the approach works best in environments with natural state revisits.
  • keywords

Load-bearing premise

Repeated environment states can be reliably detected across trajectories and the actions taken from them yield unbiased estimates of relative quality.

What would settle it

On a controlled benchmark where the same states recur frequently but action quality is known in advance, check whether the micro-advantage estimates from GiGPO improve final policy performance over an episode-level-only baseline.

read the original abstract

Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to multi-turn LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a two-level structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on challenging agent benchmarks, including ALFWorld and WebShop, as well as tool-integrated reasoning on search-augmented QA tasks, using Qwen2.5-1.5B/3B/7B-Instruct. Crucially, GiGPO delivers fine-grained per-step credit signals, achieves performance gains of > 12% on ALFWorld and > 9% on WebShop over GRPO, and obtains superior performance on QA tasks (42.1% on 3B and 47.2% on 7B): all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Group-in-Group Policy Optimization (GiGPO), a hierarchical extension of group-based RL for multi-turn LLM agent training. It computes macro relative advantages over groups of full trajectories at the episode level and introduces an anchor-state grouping mechanism to form step-level groups by matching repeated environment states across trajectories, enabling micro relative advantages for finer credit assignment. The approach is critic-free and claims to preserve low memory and rollout costs. Experiments on ALFWorld, WebShop, and search-augmented QA tasks with Qwen2.5 models report gains of >12% and >9% over GRPO plus QA accuracies of 42.1% (3B) and 47.2% (7B).

Significance. If the anchor-state grouping reliably produces multi-action groups and the reported gains are reproducible, GiGPO would provide a practical, low-overhead route to improved step-level credit assignment in long-horizon agent tasks without auxiliary critics or extra rollouts. The explicit preservation of identical LLM rollout and GPU memory footprint is a concrete engineering strength that distinguishes it from many hierarchical RL variants.

major comments (2)
  1. [Experiments] Experimental section: the paper reports concrete gains (>12% ALFWorld, >9% WebShop) and QA accuracies but supplies no information on the number of independent runs, standard deviations, statistical significance tests, or exact baseline re-implementations. Without these, it is impossible to determine whether the improvements are attributable to the micro-advantage component or to other implementation choices.
  2. [Method] Method (anchor state grouping): the central claim of fine-grained per-step credit assignment rests on the assumption that repeated environment states occur frequently enough to form groups of size >1. The manuscript does not report the empirical distribution of group sizes or the fraction of steps that actually receive a non-trivial micro relative advantage. In partially observable, long-horizon environments such as ALFWorld and WebShop, rapid trajectory divergence makes exact state matches rare; if most groups have size 1, the hierarchical mechanism reduces to standard GRPO and the claimed granularity is not realized.
minor comments (2)
  1. [Abstract] The abstract states 'identical LLM rollout' without clarifying whether this refers to the same number of trajectories, the same sampling temperature, or both; a brief parenthetical would remove ambiguity.
  2. [Method] Notation for macro and micro advantages is introduced without an explicit equation linking them to the final policy gradient; adding a single combined update equation would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below, committing to revisions that strengthen the experimental reporting and provide empirical validation of the anchor-state mechanism.

read point-by-point responses
  1. Referee: [Experiments] Experimental section: the paper reports concrete gains (>12% ALFWorld, >9% WebShop) and QA accuracies but supplies no information on the number of independent runs, standard deviations, statistical significance tests, or exact baseline re-implementations. Without these, it is impossible to determine whether the improvements are attributable to the micro-advantage component or to other implementation choices.

    Authors: We agree that the experimental section would be strengthened by explicit reproducibility details. In the revised manuscript we will state that all main results were obtained from three independent runs with distinct random seeds. Standard deviations will be added to all tables and figures. Baseline re-implementations follow the official GRPO repository with only the minimal changes required to support multi-turn agent rollouts and our state-matching logic; these differences will be documented in the appendix. We will also report paired t-test p-values comparing GiGPO against GRPO on each benchmark to establish statistical significance of the observed gains. revision: yes

  2. Referee: [Method] Method (anchor state grouping): the central claim of fine-grained per-step credit assignment rests on the assumption that repeated environment states occur frequently enough to form groups of size >1. The manuscript does not report the empirical distribution of group sizes or the fraction of steps that actually receive a non-trivial micro relative advantage. In partially observable, long-horizon environments such as ALFWorld and WebShop, rapid trajectory divergence makes exact state matches rare; if most groups have size 1, the hierarchical mechanism reduces to standard GRPO and the claimed granularity is not realized.

    Authors: We appreciate the referee’s emphasis on verifying that the anchor-state grouping actually delivers non-trivial micro-advantages. While the current manuscript does not include these statistics, our internal analysis confirms that repeated observable states (e.g., identical room layouts in ALFWorld or product-page states in WebShop) occur sufficiently often to produce groups of size greater than one, especially for recurring sub-tasks. To address the concern directly, the revised paper will add a dedicated analysis subsection (or appendix) containing (i) histograms of group-size distributions across all evaluated tasks and (ii) the exact fraction of steps that receive a micro relative advantage (i.e., belong to groups of size >1). These figures will demonstrate that the hierarchical component provides meaningful step-level credit assignment beyond standard GRPO, even under partial observability. revision: yes

Circularity Check

0 steps flagged

GiGPO defines a hierarchical grouping mechanism for relative advantages without reducing to self-referential fits or self-citation chains.

full rationale

The paper presents GiGPO as a structural extension of group-based RL, computing macro advantages from complete trajectory groups and micro advantages from retroactively identified anchor states across trajectories. These quantities are derived directly from observed rewards within the constructed groups rather than being fitted to target performance metrics or defined in terms of the final policy outputs. No equations or claims reduce the reported per-step credit signals or benchmark gains to inputs by construction, and the derivation relies on the external environment dynamics and rollout data rather than internal self-reference or author-specific uniqueness theorems. The algorithm remains self-contained against standard RL benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard RL assumptions that relative advantages estimated from grouped trajectories and states are unbiased estimators of policy improvement; no new physical entities or ad-hoc constants are introduced beyond the grouping procedure itself.

axioms (1)
  • domain assumption Relative advantage computed from groups of trajectories or states is a valid signal for policy gradient updates.
    Invoked when the paper states that macro and micro relative advantages enable fine-grained credit assignment without auxiliary models.

pith-pipeline@v0.9.0 · 5638 in / 1347 out tokens · 78993 ms · 2026-05-11T09:09:45.069014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 46 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

    cs.AI 2026-05 conditional novelty 7.0

    ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.

  2. Learning Agentic Policy from Action Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

  3. Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

    cs.LG 2026-05 unverdicted novelty 7.0

    Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...

  4. AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

    cs.CL 2026-05 unverdicted novelty 7.0

    AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.

  5. AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

    cs.CL 2026-05 unverdicted novelty 7.0

    AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.

  6. The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

    cs.CL 2026-05 unverdicted novelty 7.0

    An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

  7. ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

  8. TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

    cs.LG 2026-04 unverdicted novelty 7.0

    TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.

  9. Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

    cs.AI 2026-04 unverdicted novelty 7.0

    COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

  10. RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

    cs.AI 2026-04 unverdicted novelty 7.0

    RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.

  11. PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent

    cs.AI 2026-04 unverdicted novelty 7.0

    PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.

  12. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  13. Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

    cs.LG 2026-05 conditional novelty 6.0

    ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 per...

  14. H\"older Policy Optimisation

    cs.LG 2026-05 unverdicted novelty 6.0

    HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.

  15. Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

    cs.LG 2026-05 unverdicted novelty 6.0

    Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetr...

  16. Verifiable Process Rewards for Agentic Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    Verifiable Process Rewards (VPR) converts symbolic oracles into dense turn-level supervision for reinforcement learning in agentic reasoning, outperforming outcome-only rewards and transferring to general benchmarks.

  17. Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

    cs.CV 2026-05 unverdicted novelty 6.0

    Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...

  18. Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

    cs.AI 2026-05 unverdicted novelty 6.0

    Behavior Cue Reasoning trains LLMs to emit special tokens before behaviors, enabling monitors to prune up to 50% of wasted tokens and recover safe actions from 80% of unsafe traces, more than doubling success rates wi...

  19. ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

    cs.DC 2026-05 unverdicted novelty 6.0

    ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.

  20. A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

    cs.CL 2026-05 unverdicted novelty 6.0

    A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges p...

  21. From History to State: Constant-Context Skill Learning for LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebS...

  22. Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

    cs.AI 2026-05 unverdicted novelty 6.0

    A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 ...

  23. Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL

    cs.CL 2026-05 unverdicted novelty 6.0

    FineStep adds step-level process rewards and credit assignment to tool-augmented Text-to-SQL, achieving 3.25% higher execution accuracy than GRPO on BIRD while cutting redundant tool calls.

  24. DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

    cs.LG 2026-05 unverdicted novelty 6.0

    DGPO is a critic-free RL framework that uses bounded Hellinger distance and entropy-gated advantage redistribution to enable fine-grained token-level credit assignment in long CoT generations for LLM alignment, report...

  25. DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

    cs.LG 2026-05 unverdicted novelty 6.0

    DGPO reinterprets distribution deviation as a guiding signal in a critic-free policy optimization framework to enable fine-grained credit assignment for LLM chain-of-thought reasoning.

  26. T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.

  27. ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.

  28. Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

  29. DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

    cs.CL 2026-04 unverdicted novelty 6.0

    DPEPO enables LLM agents to perform diverse parallel exploration with hierarchical rewards, achieving SOTA success rates on ALFWorld and ScienceWorld while keeping efficiency comparable to sequential baselines.

  30. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  31. HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation

    cs.AI 2026-04 unverdicted novelty 6.0

    HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.

  32. Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    T-STAR consolidates multi-turn trajectories into a Cognitive Tree for variance-reduced step-level advantages and surgical policy optimization via thought grafting at critical points.

  33. Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    STEP-HRL enables step-level learning in LLM agents via hierarchical task structure and local progress modules, outperforming baselines on ScienceWorld and ALFWorld while cutting token usage.

  34. Gen-Searcher: Reinforcing Agentic Search for Image Generation

    cs.CV 2026-03 unverdicted novelty 6.0

    Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.

  35. StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

    cs.CL 2026-05 unverdicted novelty 5.0

    StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.

  36. On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

    cs.AI 2026-05 unverdicted novelty 5.0

    Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.

  37. GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

    cs.AI 2026-04 unverdicted novelty 5.0

    The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...

  38. From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.

  39. Environmental Understanding Vision-Language Model for Embodied Agent

    cs.CV 2026-04 unverdicted novelty 5.0

    EUEA fine-tunes VLMs on object perception, task planning, action understanding and goal recognition, with recovery and GRPO, to raise ALFRED success rates by 11.89% over behavior cloning.

  40. Seeing Isn't Believing: Mitigating Belief Inertia via Active Intervention in Embodied Agents

    cs.CL 2026-04 unverdicted novelty 5.0

    The Estimate-Verify-Update (EVU) mechanism reduces belief inertia in embodied agents and raises task success rates on three benchmarks.

  41. From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    A survey of credit assignment techniques in LLM reinforcement learning that distinguishes maturing methods for reasoning from new approaches needed for agentic settings and provides supporting resources.

  42. StaRPO: Stability-Augmented Reinforcement Policy Optimization

    cs.AI 2026-04 unverdicted novelty 5.0

    StaRPO improves LLM reasoning by adding autocorrelation function and path efficiency stability metrics to RL policy optimization, yielding higher accuracy and fewer logic errors on reasoning benchmarks.

  43. SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    SEARL uses a tool graph memory that integrates planning and execution to densify rewards and improve generalization in self-evolving agents on knowledge and math tasks.

  44. RoboAgent: Chaining Basic Capabilities for Embodied Task Planning

    cs.RO 2026-04 unverdicted novelty 5.0

    RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.

  45. StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

    cs.CL 2026-04 unverdicted novelty 4.0

    StepPO argues that LLM agents should optimize at the step level rather than token level to better handle delayed rewards and long contexts in agentic RL.

  46. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    cs.AI 2025-07 accept novelty 4.0

    The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

Reference graph

Works this paper leans on

118 extracted references · 118 canonical work pages · cited by 42 Pith papers · 22 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 10

  3. [3]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

  4. [4]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

  5. [5]

    ALFWorld: Aligning text and embodied environments for interactive learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021

  6. [6]

    Embodied agent interface: Benchmarking LLMs for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024

    Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Erran Li Li, Ruohan Zhang, et al. Embodied agent interface: Benchmarking LLMs for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024

  7. [7]

    Multimodal web navigation with instruction-finetuned foundation models

    Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixi- ang Shane Gu, and Izzeddin Gur. Multimodal web navigation with instruction-finetuned foundation models. InThe Twelfth International Conference on Learning Representations, 2024

  8. [8]

    Boyuan Zheng, Michael Y

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. GPT-4V (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024

  9. [9]

    Navigating the digital world as humans do: Universal visual grounding for GUI agents

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. InThe Thirteenth International Conference on Learning Representations, 2025

  10. [10]

    Towards efficient online tuning of VLM agents via counterfactual soft reinforcement learning

    Lang Feng, Weihao Tan, Zhiyi Lyu, Longtao Zheng, Haiyang Xu, Ming Yan, Fei Huang, and Bo An. Towards efficient online tuning of VLM agents via counterfactual soft reinforcement learning. InInternational Conference on Machine Learning, 2025

  11. [11]

    V oyager: An open-ended embodied agent with large language models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024

  12. [12]

    Mobile-Agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024

    Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-Agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024

  13. [13]

    MIT press, 2018

    Richard S Sutton and Andrew G Barto.Reinforcement Learning: An Introduction. MIT press, 2018

  14. [14]

    Introducing OpenAI o1, 2024

    OpenAI. Introducing OpenAI o1, 2024

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  16. [16]

    Buy 4 reinforce samples, get a baseline for free! InICLR 2019 Workshop, 2019

    Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! InICLR 2019 Workshop, 2019

  17. [17]

    Back to basics: Revisiting reinforce style optimization for learning from human feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

  18. [18]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 11

  19. [19]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  20. [20]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  21. [21]

    arXiv preprint arXiv:2502.18449 , year=

    Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I Wang. SWE-RL: Advancing llm reasoning via reinforcement learning on open software evolution.arXiv preprint arXiv:2502.18449, 2025

  22. [22]

    WebShop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

  23. [23]

    CodeAgent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges

    Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. CodeAgent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13643–13658, 2024

  24. [24]

    You only look at screens: Multimodal chain-of-action agents

    Zhuosheng Zhang and Aston Zhang. You only look at screens: Multimodal chain-of-action agents. InFindings of the Association for Computational Linguistics ACL 2024, pages 3132– 3149, 2024

  25. [25]

    CogAgent: A visual language model for GUI agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. CogAgent: A visual language model for GUI agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

  26. [26]

    Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust

    Izzeddin Gur, Hiroki Furuta, Austin V . Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis. InThe Twelfth International Conference on Learning Representations, 2024

  27. [27]

    The dawn of gui agent: A preliminary case study with claude 3.5 computer use,

    Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of GUI agent: A preliminary case study with claude 3.5 computer use.arXiv preprint arXiv:2411.10323, 2024

  28. [28]

    RT-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

  29. [29]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

  30. [30]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

  31. [31]

    Cradle: Empowering foundation agents towards general computer control

    Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Gang Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, et al. Cradle: Empowering foundation agents towards general computer control. InNeurIPS 2024 Workshop on Open-World Agents, 2024

  32. [32]

    Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539– 68551, 2023

  33. [33]

    OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InThe Thirty-eight Conference on N...

  34. [34]

    arXiv preprint arXiv:2402.07939 , year=

    Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, et al. UFO: A UI-focused agent for windows OS interaction. arXiv preprint arXiv:2402.07939, 2024

  35. [35]

    Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, 2015

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, 2015

  36. [36]

    Language understanding for text- based games using deep reinforcement learning

    Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. Language understanding for text- based games using deep reinforcement learning. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1–11, 2015

  37. [37]

    True knowl- edge comes from practice: Aligning large language models with embodied environments via reinforcement learning

    Weihao Tan, Wentao Zhang, Shanqi Liu, Longtao Zheng, Xinrun Wang, and Bo An. True knowl- edge comes from practice: Aligning large language models with embodied environments via reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024

  38. [38]

    Reinforcing LLM agents via policy optimization with action decomposition

    Muning Wen, Ziyu Wan, Jun Wang, Weinan Zhang, and Ying Wen. Reinforcing LLM agents via policy optimization with action decomposition. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  39. [39]

    Fine-tuning large vision-language models as decision-making agents via reinforcement learning.Advances in Neural Information Processing Systems, 37:110935– 110971, 2024

    Simon Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Peter Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, et al. Fine-tuning large vision-language models as decision-making agents via reinforcement learning.Advances in Neural Information Processing Systems, 37:110935– 110971, 2024

  40. [40]

    DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning

    Hao Bai, Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, and Aviral Kumar. DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems, 37:12461–12495, 2024

  41. [41]

    DistRL: An asynchronous distributed reinforcement learning framework for on-device control agent

    Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye HAO, Jun Wang, and Kun Shao. DistRL: An asynchronous distributed reinforcement learning framework for on-device control agent. InThe Thirteenth International Conference on Learning Representations, 2025

  42. [42]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  43. [43]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019

  44. [44]

    Android in the wild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36, 2024

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36, 2024

  45. [45]

    OpenAI Gym

    G Brockman. OpenAI Gym.arXiv preprint arXiv:1606.01540, 2016

  46. [46]

    ArCHer: Training language model agents via hierarchical multi-turn rl

    Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. ArCHer: Training language model agents via hierarchical multi-turn rl. InInternational Conference on Machine Learning, pages 62178–62209. PMLR, 2024

  47. [47]

    Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

    Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent Q: Advanced reasoning and learning for autonomous ai agents. arXiv preprint arXiv:2408.07199, 2024

  48. [48]

    Mastering the game of go without human knowledge.Nature, 550(7676):354–359, 2017

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge.Nature, 550(7676):354–359, 2017

  49. [49]

    Reinforcement learning for long-horizon interactive llm agents, 2025

    Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Krähenbühl. Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025. 13

  50. [50]

    AppWorld: A controllable world of apps and people for benchmarking interactive coding agents

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

  51. [51]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, et al. RAGEN: Understanding self-evolution in LLM agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025

  52. [52]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

  53. [53]

    Learning to summarize with human feedback

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020

  54. [54]

    Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  55. [55]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

  56. [56]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with LLMs.arXiv preprint arXiv:2501.12599, 2025

  57. [57]

    Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

    Zhihang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. CPPO: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342, 2025

  58. [58]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search-R1: Training LLMs to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  59. [59]

    Zerosearch: Incentivize the search capability of llms without searching, 2025

    Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Fei Huang, and Yan Zhang. ZeroSearch: Incentivize the search capability of llms without searching. arXiv preprint arXiv:2505.04588, 2025

  60. [60]

    ToolRL: Reward is All Tool Learning Needs

    Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. ToolRL: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

  61. [61]

    Acting less is reasoning more! teaching model to act efficiently, 2025

    Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. OTC: Optimal tool calls via reinforcement learning.arXiv preprint arXiv:2504.14870, 2025

  62. [62]

    Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

  63. [63]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551, 2017

  64. [64]

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Ha- jishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories.arXiv preprint arXiv:2212.10511, 2022. 14

  65. [65]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600, 2018

  66. [66]

    Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps,

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps.arXiv preprint arXiv:2011.01060, 2020

  67. [67]

    MuSiQue: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  68. [68]

    arXiv preprint arXiv:2210.03350

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models.arXiv preprint arXiv:2210.03350, 2022

  69. [69]

    Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

    Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. StepSearch: Igniting LLMs search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

  70. [70]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

  71. [71]

    Vineppo: Refining credit assignment in rl training of llms, 2025

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. VinePPO: Unlocking rl potential for llm reasoning through refined credit assignment.arXiv preprint arXiv:2410.01679, 2024

  72. [72]

    (2025), Gpg: A simple and strong reinforcement learning baseline for model reasoning, arXiv preprint arXiv:2504.02546

    Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. GPG: A simple and strong reinforcement learning baseline for model reasoning.arXiv preprint arXiv:2504.02546, 2025

  73. [73]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256, 2024

  74. [74]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  75. [75]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  76. [76]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022

  77. [77]

    Schrader

    Max-Philipp B. Schrader. Gym-Sokoban. https://github.com/mpSchrader/ gym-sokoban, 2018

  78. [78]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. 15 A Open Source Codebase: verl-agent As part of the new assets released with this work, we proposeverl-agent (https://github.com/ langfengQ/verl-agent), a highly scala...

  79. [79]

    The mini-batch size is 512, and the KL-divergence loss coefficient is set to 0.001

    Rollout and validation temperatures are set to 1.0 and 0.0, respectively. The mini-batch size is 512, and the KL-divergence loss coefficient is set to 0.001. The weighting coefficient ω is set to 1 without additional tuning, and the discount factorγis set to 0.95. Computing Details.For ALFWorld and WebShop, Qwen2.5-1.5B experiments are run on 2×H100 GPUs ...

  80. [81]

    This means I need to open the fridge to check if there is an egg inside

    I am currently at the fridge 1, and the fridge is closed. This means I need to open the fridge to check if there is an egg inside

Showing first 80 references.