pith. machine review for the scientific record. sign in

arxiv: 2605.02572 · v1 · submitted 2026-05-04 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:09 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords large language modelslong-horizon taskstraining instabilityhorizon reductiongeneralizationagent trainingexplorationcredit assignment
0
0 comments X

The pith

Increasing horizon length alone destabilizes training of large language models as agents through exploration and credit assignment difficulties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the effect of task length on training large language models that act as agents by interacting with environments over many steps. The authors build sets of tasks that use exactly the same rules and logic but require different numbers of actions to complete, so that horizon length can be varied in isolation. They show that longer horizons trigger unstable training because models struggle to find useful sequences and to link early choices to later rewards. Training the models first on shorter versions of the same tasks produces stable learning, higher success rates on the original long tasks, and better performance when the models later encounter tasks of unseen lengths.

Core claim

In controlled task constructions where agents follow identical decision rules and reasoning structures but differ only in the length of the required action sequences, increasing horizon length constitutes a training bottleneck that induces severe instability driven by exploration difficulties and credit assignment challenges. Horizon reduction during training stabilizes the optimization process, delivers better performance on long-horizon tasks, and produces stronger generalization to longer-horizon variants at inference time, a pattern termed horizon generalization.

What carries the argument

horizon reduction applied to matched tasks that preserve identical decision rules while varying only the length of the action sequence needed for success

If this is right

  • Training under reduced horizons produces stable convergence where direct long-horizon training fails.
  • Models trained with horizon reduction reach higher success rates when tested on the original long-horizon tasks.
  • Such models generalize their learned behavior to task variants whose horizon lengths were never encountered during training.
  • The same reduction simultaneously eases both exploration and credit-assignment problems in sequential decision tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pattern may appear in any sequential decision system where episode length can be artificially shortened while preserving the underlying rules.
  • A curriculum that gradually lengthens the horizon during training could combine the stability benefit with eventual exposure to full-length tasks.
  • Reported difficulties with long-horizon agent training may often trace to horizon length rather than to model size or choice of learning algorithm.

Load-bearing premise

The constructed tasks maintain identical decision rules and reasoning structures across different horizon lengths, differing solely in the required action sequence length.

What would settle it

A set of controlled tasks in which models trained directly on long horizons achieve equal or greater stability and cross-length generalization than models first trained on reduced horizons would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.02572 by Beong-woo Kwak, Furu Wei, Jinyoung Yeo, Junhee Cho, Liang Wang, Nan Yang, Sunghwan Kim, Taeyoon Kwon, Xingxing Zhang.

Figure 1
Figure 1. Figure 1: A summary of our contributions. In this work, we study the training of long-horizon LLM agents from a horizon-centric perspective and identify horizon length as a fundamental bottleneck. We show that horizon reduction stabilizes RL and strengthens the tendency toward horizon generalization on longer tasks with similar reasoning difficulty. performance collapse. Crucially, this degradation occurs even when … view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics on different goal distance. While RL training is stable on short goal distance (L1–L2), it exhibits severe instability as the goal distance increases (L3–L4). this assumption, we construct short-horizon proxy tasks by converting long-horizon tasks into equivalent single-step formulations (i.e., asking the model to generate a complete Sudoku solution in one step instead of solving it incre… view at source ↗
Figure 3
Figure 3. Figure 3: Horizon reduction improves RL on long-horizon tasks. Training and test success rate on Sudoku and Rush Hour with atomic actions versus macro actions across different goal distance regimes. Across both environments, using macro actions for horizon reduction leads to more stable and effective RL, particularly in a long goal distance setting. Rush Hour. To verify that our insights are not confined to a single… view at source ↗
Figure 4
Figure 4. Figure 4: RL stability depends on effective horizon. We compare two settings with macro-action policy: (A) reduced effective hori￾zon via macro actions, and (B) an artificially restored long-horizon setting by restricting execution to single atomic actions. plex methods, we argue for a more fundamental solution. “The best way to escape from a problem is to solve it.” Instead of forcing the agent to learn intractable… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of subgoal decomposition on RL. Average success rate on Sudoku across increasing goal distances. faster and achieves higher final performance than the atomic￾action baseline. More importantly, on long goal distance tasks (L3–L4), where training with atomic actions suffers catastrophic collapse, macro actions maintain stable learn￾ing and continue to improve performance. These results show that a sim… view at source ↗
Figure 7
Figure 7. Figure 7: Robustness of horizon reduction across diverse settings. (Left) On WebShop, horizon reduction improves both training stability and average success rate. (Middle) On Sudoku (L3–L4) with a 4B model, training collapse persists under the default horizon, while horizon reduction yields stable improvement. (Right) Under a GRPO-style optimizer, the same instability pattern emerges and is resolved by horizon reduc… view at source ↗
Figure 8
Figure 8. Figure 8: Horizon generalization. (Left and middle) Results on Sudoku and Rush Hour demonstrate that policies trained on limited goal distance ranges generalize effectively to unseen horizons. (Right) Success rates on Sudoku as a function of goal distance for models with different step accuracy reveal that macro-action policies consistently outperform atomic actions across horizons. RL-short and RL-long are trained … view at source ↗
Figure 9
Figure 9. Figure 9: Horizon curriculum. On Rush Hour, we compare three training strategies: Short-only trains on 4 ≤ d(s0, g) ≤ 9, Long￾only trains on 10 ≤ d(s0, g) ≤ 12, and Curriculum first trains on short horizons then continues on long horizons. 4.4. In-Depth Analysis Horizon generalization. We evaluate whether models trained on a fixed range of goal distances can generalize to unseen horizons. As shown in view at source ↗
Figure 10
Figure 10. Figure 10: Training dynamics under additional experimental settings. In all panels, horizon reduction refers to training with macro action, while default refers to training with atomic action. Atomic action2 denotes a variant in which the policy is trained with macro action but the environment permits only atomic action, resulting in a longer effective horizon despite the macro-action policy. This setting follows th… view at source ↗
Figure 11
Figure 11. Figure 11: Succeses rates and goal distance for Rush Hour. RL-short trains on 4 ≤ d(s0, g) ≤ 9, RL-long trains on 10 ≤ d(s0, g) ≤ 12, and RL-long-curriculum first trains on short horizons then continues on long horizons view at source ↗
Figure 12
Figure 12. Figure 12: Evaluation of horizon and technique generalization in Sudoku. Generalization holds within seen (easy) techniques but breaks under increased technique difficulty (medium and hard). Points are jittered for visualization. • Hard: Techniques that involve long-range dependency tracking, chained reasoning, or complex candidate propagation across multiple units. Naked Quadruple, Hidden Rectangle, Avoidable Recta… view at source ↗
Figure 13
Figure 13. Figure 13: Effect of macro action design on frontier models. Average success rates for GPT-5-mini and Gemini-3-Flash-Preview under different action designs, including atomic actions, fixed-length macro actions (n=2, 5), and flexible macro actions (n ≤ k or unbounded). Flexible macro actions are generally beneficial across models, although their performance ceiling varies by model. Notably, for Gemini-3-Flash-Preview… view at source ↗
Figure 14
Figure 14. Figure 14: Prompt used for Sudoku experiments. 24 view at source ↗
Figure 15
Figure 15. Figure 15: Prompt used for Rush Hour experiments. 25 view at source ↗
Figure 16
Figure 16. Figure 16: Case study for our RL model in Sudoku (successful case). 26 view at source ↗
Figure 17
Figure 17. Figure 17: Case study for our RL model in Sudoku (failed case). 27 view at source ↗
Figure 18
Figure 18. Figure 18: Case study for our RL model in Rush Hour (successful case). 28 view at source ↗
Figure 19
Figure 19. Figure 19: Case study for our RL model in Rush Hour (failed case). 29 view at source ↗
read the original abstract

Large language models (LLMs) have shown promise as interactive agents that solve tasks through extended sequences of environment interactions. While prior work has primarily focused on system-level optimizations or algorithmic improvements, the role of task horizon length in shaping training dynamics remains poorly understood. In this work, we present a systematic empirical study that examines horizon length through controlled task constructions. Specifically, we construct controlled tasks in which agents face identical decision rules and reasoning structures, but differ only in the length of action sequences required for successful completion. Our results reveal that increasing horizon length alone constitutes a training bottleneck, inducing severe training instability driven by exploration difficulties and credit assignment challenges. We demonstrate that horizon reduction is a key principle to address this limitation, stabilizing training and achieving better performance in long-horizon tasks. Moreover, we find that horizon reduction is related to stronger generalization across horizon lengths: models trained under reduced horizons generalize more effectively to longer-horizon variants at inference time, a phenomenon we refer to as horizon generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper conducts an empirical study of how horizon length affects LLM training as interactive agents. Using controlled task variants that purportedly hold decision rules, state transitions, and reasoning structures fixed while varying only the required action-sequence length, it reports that longer horizons induce severe training instability from exploration and credit-assignment difficulties. It further claims that training with reduced horizons stabilizes learning, yields higher performance on long-horizon tasks, and improves generalization to longer horizons at inference time.

Significance. If the isolation of horizon length is rigorously verified and the reported effects are statistically robust, the work would supply a useful empirical principle for curriculum design in LLM-agent training. The observation that reduced-horizon training can improve long-horizon generalization is potentially actionable for practitioners and merits further investigation.

major comments (2)
  1. [Methods / Task Construction] Task-construction section (Methods): The central claim that instability is caused by horizon length alone rests on the premise that decision rules, reward functions, state-transition dynamics, and branching factors are identical across variants. The manuscript provides no explicit verification (e.g., side-by-side state-space statistics, reward histograms, or branching-factor measurements) that these quantities remain unchanged when horizon is extended. Without such checks, observed differences cannot be confidently attributed to length rather than unintended alterations in task structure.
  2. [Results] Results and evaluation sections: The abstract and summary statements assert “severe training instability” and “better performance,” yet the provided description contains no quantitative metrics (success rate, cumulative reward, variance across random seeds), statistical tests, or comparisons against standard baselines (e.g., PPO with different horizons, curriculum learning, or hierarchical methods). This absence prevents assessment of effect size and reproducibility.
minor comments (1)
  1. [Abstract] The abstract refers to “horizon generalization” as a new phenomenon; a brief literature comparison would clarify how this differs from existing curriculum-learning or transfer results in RL.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. We have addressed each major comment below and made corresponding revisions to enhance the rigor of our empirical analysis.

read point-by-point responses
  1. Referee: [Methods / Task Construction] Task-construction section (Methods): The central claim that instability is caused by horizon length alone rests on the premise that decision rules, reward functions, state-transition dynamics, and branching factors are identical across variants. The manuscript provides no explicit verification (e.g., side-by-side state-space statistics, reward histograms, or branching-factor measurements) that these quantities remain unchanged when horizon is extended. Without such checks, observed differences cannot be confidently attributed to length rather than unintended alterations in task structure.

    Authors: We thank the referee for highlighting this important aspect of our task construction. While the manuscript describes the controlled task variants in detail, we agree that additional explicit verification would enhance confidence in attributing effects solely to horizon length. In the revised version, we have added a dedicated paragraph and accompanying table in the Methods section that provides side-by-side statistics for state-space sizes, branching factors, reward histograms, and transition dynamics across all horizon variants. These metrics confirm that the variants differ only in the required action sequence length, with identical decision rules and dynamics. We have also made the task generation code publicly available for reproducibility. revision: yes

  2. Referee: [Results] Results and evaluation sections: The abstract and summary statements assert “severe training instability” and “better performance,” yet the provided description contains no quantitative metrics (success rate, cumulative reward, variance across random seeds), statistical tests, or comparisons against standard baselines (e.g., PPO with different horizons, curriculum learning, or hierarchical methods). This absence prevents assessment of effect size and reproducibility.

    Authors: We appreciate this feedback on the presentation of our results. The full manuscript includes figures illustrating training curves and performance, but we acknowledge the need for more quantitative reporting. In the revision, we have added a results table summarizing success rates, average cumulative rewards, and standard deviations over multiple random seeds for each condition. We also include p-values from statistical tests comparing the reduced-horizon training to full-horizon baselines. Additionally, we have incorporated comparisons to standard methods such as PPO with fixed horizons and a curriculum learning baseline, showing that our approach yields superior stability and generalization. These changes provide clearer effect sizes and support reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations or fitted quantities

full rationale

The paper is an empirical investigation that reports results from controlled experiments on task variants. It contains no equations, no parameter fitting, no derivations, and no load-bearing self-citations that reduce claims to prior fitted inputs or self-defined quantities. The central observation—that longer horizons induce instability—is presented as a direct experimental outcome rather than a mathematical reduction. The task-construction premise (identical decision rules, differing only in sequence length) is an experimental design choice whose validity is external to any derivation chain inside the paper. No step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the controlled task constructions and standard assumptions of LLM fine-tuning and reinforcement learning; no new entities or free parameters are introduced in the reported findings.

axioms (1)
  • domain assumption The constructed tasks maintain identical decision rules and reasoning structures while varying only sequence length.
    Invoked in the abstract to isolate horizon length as the sole variable.

pith-pipeline@v0.9.0 · 5501 in / 1206 out tokens · 51596 ms · 2026-05-08T18:09:17.751835+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

96 extracted references · 57 canonical work pages · 21 internal anchors

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  3. [3]

    M. J. Kearns , title =

  4. [4]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  5. [5]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  6. [6]

    Suppressed for Anonymity , author=

  7. [7]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  8. [8]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  9. [9]

    arXiv preprint arXiv:2509.09677 , year=

    The illusion of diminishing returns: Measuring long horizon execution in llms , author=. arXiv preprint arXiv:2509.09677 , year=

  10. [10]

    arXiv preprint arXiv:2501.02709 , year=

    Horizon Generalization in Reinforcement Learning , author=. arXiv preprint arXiv:2501.02709 , year=

  11. [11]

    Sudoku-bench: Evaluating creative reasoning with sudoku variants

    Sudoku-Bench: Evaluating creative reasoning with Sudoku variants , author=. arXiv preprint arXiv:2505.16135 , year=

  12. [12]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  13. [13]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  14. [14]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

  15. [15]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

  16. [16]

    Group Sequence Policy Optimization

    Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

  17. [17]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention , author=. arXiv preprint arXiv:2506.13585 , year=

  18. [18]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models , author=. arXiv preprint arXiv:2501.03262 , year=

  19. [19]

    arXiv preprint arXiv:2507.20673 , year=

    Geometric-mean policy optimization , author=. arXiv preprint arXiv:2507.20673 , year=

  20. [20]

    Soft Adaptive Policy Optimization

    Soft adaptive policy optimization , author=. arXiv preprint arXiv:2511.20347 , year=

  21. [21]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  22. [22]

    Machine learning , volume=

    Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

  23. [23]

    arXiv preprint arXiv:2510.01051 , year=

    Gem: A gym for agentic llms , author=. arXiv preprint arXiv:2510.01051 , year=

  24. [24]

    nature , volume=

    Grandmaster level in StarCraft II using multi-agent reinforcement learning , author=. nature , volume=. 2019 , publisher=

  25. [25]

    nature , volume=

    Mastering the game of go without human knowledge , author=. nature , volume=. 2017 , publisher=

  26. [26]

    Dota 2 with Large Scale Deep Reinforcement Learning

    Dota 2 with large scale deep reinforcement learning , author=. arXiv preprint arXiv:1912.06680 , year=

  27. [27]

    Advances in Neural Information Processing Systems , year=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , year=

  28. [28]

    OpenAI o1 System Card

    Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

  29. [29]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  30. [30]

    Proceedings of the 29th symposium on operating systems principles , pages=

    Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=

  31. [31]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Pytorch fsdp: experiences on scaling fully sharded data parallel , author=. arXiv preprint arXiv:2304.11277 , year=

  32. [32]

    Lost in the maze: Overcoming context limitations in long-horizon agentic search.arXiv preprint arXiv:2510.18939, 2025

    Lost in the Maze: Overcoming Context Limitations in Long-Horizon Agentic Search , author=. arXiv preprint arXiv:2510.18939 , year=

  33. [33]

    2025 , url =

    Anthropic , title =. 2025 , url =

  34. [34]

    2025 , eprint=

    REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards , author=. 2025 , eprint=

  35. [35]

    When Speed Kills Stability: Demystifying

    Liu, Jiacai and Li, Yingru and Fu, Yuqian and Wang, Jiawei and Liu, Qian and Jiang, Zhuo , year =. When Speed Kills Stability: Demystifying

  36. [36]

    Your Efficient RL Framework Secretly Brings You Off-Policy RL Training , url =

    Yao, Feng and Liu, Liyuan and Zhang, Dinghuai and Dong, Chengyu and Shang, Jingbo and Gao, Jianfeng , journal =. Your Efficient RL Framework Secretly Brings You Off-Policy RL Training , url =

  37. [37]

    Doing: Agents that Reason by Scaling Test-Time Interaction , author=

    Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction , author=. arXiv preprint arXiv:2506.07976 , year=

  38. [38]

    Agentgym-rl: Training LLM agents for long-horizon decision making through multi-turn reinforcement learning

    Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning , author=. arXiv preprint arXiv:2509.08755 , year=

  39. [39]

    Kimi K2: Open Agentic Intelligence

    Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

  40. [40]

    Anthropic , year=

    Claude Code overview , url=. Anthropic , year=

  41. [41]

    OpenAI , year=

    Codex overview , url=. OpenAI , year=

  42. [42]

    Horizon Reduction Makes RL Scalable , October 2025 b

    Horizon Reduction Makes RL Scalable , author=. arXiv preprint arXiv:2506.04168 , year=

  43. [43]

    arXiv preprint arXiv:2505.20732 , year=

    SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution , author=. arXiv preprint arXiv:2505.20732 , year=

  44. [44]

    arXiv preprint arXiv:2507.04103 , year=

    How to train your llm web agent: A statistical diagnosis , author=. arXiv preprint arXiv:2507.04103 , year=

  45. [45]

    WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

    WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks , author=. arXiv preprint arXiv:2601.02439 , year=

  46. [46]

    Towards general agentic intelligence via environment scaling

    Towards general agentic intelligence via environment scaling , author=. arXiv preprint arXiv:2509.13311 , year=

  47. [47]

    The Thirteenth International Conference on Learning Representations , year=

    ToolACE: Winning the Points of LLM Function Calling , author=. The Thirteenth International Conference on Learning Representations , year=

  48. [48]

    Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay

    Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay , author=. arXiv preprint arXiv:2504.03601 , year=

  49. [49]

    Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025

    Scaling long-horizon llm agent via context-folding , author=. arXiv preprint arXiv:2510.11967 , year=

  50. [50]

    Recursive Language Models

    Recursive Language Models , author=. arXiv preprint arXiv:2512.24601 , year=

  51. [51]

    2008 , url=

    Bernhard Hobiger , title=. 2008 , url=

  52. [52]

    Group-in-Group Policy Optimization for LLM Agent Training

    Group-in-group policy optimization for llm agent training , author=. arXiv preprint arXiv:2505.10978 , year=

  53. [53]

    Scaling agents via continual pre-training.arXiv preprint arXiv:2509.13310, 2025

    Scaling agents via continual pre-training , author=. arXiv preprint arXiv:2509.13310 , year=

  54. [54]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Beyond browsing: Api-based web agents , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  55. [55]

    Forty-first International Conference on Machine Learning , year =

    Executable code actions elicit better llm agents , author=. Forty-first International Conference on Machine Learning , year =

  56. [56]

    Forty-second International Conference on Machine Learning , year=

    Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks , author=. Forty-second International Conference on Machine Learning , year=

  57. [57]

    arXiv preprint arXiv:2505.15277 , year=

    Web-Shepherd: Advancing PRMs for Reinforcing Web Agents , author=. arXiv preprint arXiv:2505.15277 , year=

  58. [58]

    Agentprm: Process reward models for llm agents via step-wise promise and progress.arXiv preprint arXiv:2511.08325,

    AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress , author=. arXiv preprint arXiv:2511.08325 , year=

  59. [59]

    arXiv preprint arXiv:2511.19314 , year=

    PRInTS: Reward Modeling for Long-Horizon Information Seeking , author=. arXiv preprint arXiv:2511.19314 , year=

  60. [60]

    Proceedings of the 41st International Conference on Machine Learning , pages=

    ArCHer: training language model agents via hierarchical multi-turn RL , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

  61. [61]

    Forty-second International Conference on Machine Learning , year=

    Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning , author=. Forty-second International Conference on Machine Learning , year=

  62. [62]

    Context as a tool: Context management for long-horizon swe-agents.arXiv preprint arXiv:2512.22087, 2025

    Context as a Tool: Context Management for Long-Horizon SWE-Agents , author=. arXiv preprint arXiv:2512.22087 , year=

  63. [63]

    arXiv preprint arXiv:2501.04575 , year=

    Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection , author=. arXiv preprint arXiv:2501.04575 , year=

  64. [64]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

  65. [65]

    Ui-r1: Enhancing action prediction of gui agents by reinforcement learning

    UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning , author=. arXiv preprint arXiv:2503.21620 , year=

  66. [66]

    Arpo: End-to-end policy optimization for gui agents with experience replay.arXiv preprint arXiv:2505.16282, 2025

    ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay , author=. arXiv preprint arXiv:2505.16282 , year=

  67. [67]

    The Thirteenth International Conference on Learning Representations , year=

    Flow: Modularized Agentic Workflow Automation , author=. The Thirteenth International Conference on Learning Representations , year=

  68. [68]

    The Thirteenth International Conference on Learning Representations , year=

    AFlow: Automating Agentic Workflow Generation , author=. The Thirteenth International Conference on Learning Representations , year=

  69. [69]

    arXiv preprint arXiv:2502.07373 , year=

    Evoflow: Evolving diverse agentic workflows on the fly , author=. arXiv preprint arXiv:2502.07373 , year=

  70. [70]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning , author=. arXiv preprint arXiv:2509.02544 , year=

  71. [71]

    Reinforcement learning for long-horizon interactive llm agents, 2025

    Reinforcement learning for long-horizon interactive llm agents , author=. arXiv preprint arXiv:2502.01600 , year=

  72. [72]

    First Workshop on Multi-Turn Interactions in Large Language Models , year=

    Verlog: Context-lite Multi-turn Reinforcement Learning framework for Long-Horizon LLM Agents , author=. First Workshop on Multi-Turn Interactions in Large Language Models , year=

  73. [73]

    Advances in Neural Information Processing Systems , volume=

    Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  74. [74]

    arXiv preprint arXiv:2505.16348 , year=

    Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance , author=. arXiv preprint arXiv:2505.16348 , year=

  75. [75]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=

  76. [76]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

  77. [77]

    TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

    Theagentcompany: benchmarking llm agents on consequential real world tasks , author=. arXiv preprint arXiv:2412.14161 , year=

  78. [78]

    The eleventh international conference on learning representations , year=

    React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

  79. [79]

    arXiv preprint arXiv:2204.07705 , year=

    Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks , author=. arXiv preprint arXiv:2204.07705 , year=

  80. [80]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

Showing first 80 references.