pith. machine review for the scientific record. sign in

arxiv: 2605.05413 · v1 · submitted 2026-05-06 · 💻 cs.AI

Recognition: unknown

From History to State: Constant-Context Skill Learning for LLM Agents

Feng Ju, Haoyang Xie, Puda Zhao, Xinyuan Wang, Yancheng Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:45 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsconstant contextskill learningtask-family modulesstate trackingALFWorldWebShopSciWorld
0
0 comments X

The pith

LLM agents replace growing prompt histories with compact state blocks and learned task-family modules to cut context length while matching strong benchmark results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that recurring agent workflows can be partitioned into reusable task-family modules whose procedural knowledge is absorbed into model weights. At inference time the model conditions only on the current observation plus a small deterministic state block that tracks progress and supplies subgoal rewards, enabling step-level supervised fine-tuning followed by online reinforcement learning. This yields 89.6% unseen success on ALFWorld, 76.8% on WebShop, and 66.4% unseen success on SciWorld with Qwen3-8B, matching or exceeding prior agent-training results while using 2-7 times fewer prompt tokens per turn than ReAct baselines. A sympathetic reader would care because the approach simultaneously improves privacy, reduces cost, and maintains capability for personal assistants that operate browsers, files, and tools.

Core claim

Reusable procedures for agent tasks are learned into lightweight task-family modules; inference uses only the current observation and a compact state block produced by a deterministic tracker that also supplies aligned subgoal rewards, so each module trains via step-level SFT and is refined by online RL. With this constant-context regime, Qwen3-8B reaches 89.6% unseen success on ALFWorld, 76.8% success on WebShop, and 66.4% unseen success on SciWorld while cutting prompt tokens per turn by 2-7x relative to controlled ReAct prompting.

What carries the argument

Constant-context skill learning via task-family modules conditioned on current observation and a compact state block rendered by a deterministic progress tracker.

If this is right

  • Agents maintain high task success without accumulating full interaction histories in every prompt.
  • SFT followed by RL using tracker-supplied subgoal rewards refines each module for the target task family.
  • The same approach works across base models including Qwen3-4B, Qwen3-8B, and Llama-3.1-8B.
  • Token usage per turn drops by factors of 2-7 relative to standard history-based prompting on the tested benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Local deployment becomes more attractive because far less intermediate context leaves the device.
  • Many everyday agent tasks appear to possess modular structure that can be pre-learned once and reused.
  • The method may scale to domains where workflows are harder to partition if richer state representations are developed.

Load-bearing premise

Recurring agent workflows can be cleanly partitioned into task-family modules whose compact state block captures all necessary history and progress information without loss of critical context.

What would settle it

Replace the state block with an empty or randomly generated block on the same ALFWorld, WebShop, and SciWorld tasks and measure whether success rates fall substantially below the reported 89.6%, 76.8%, and 66.4% figures.

Figures

Figures reproduced from arXiv: 2605.05413 by Feng Ju, Haoyang Xie, Puda Zhao, Xinyuan Wang, Yancheng Wang.

Figure 1
Figure 1. Figure 1: Context-to-weights skill learning pipeline. view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative ALFWorld household trace with rendered state blocks. view at source ↗
Figure 3
Figure 3. Figure 3: Reward ablation. Using only terminal task reward during RL drops success by 4.6 points, showing that subgoal rewards provide useful inter￾mediate feedback view at source ↗
Figure 4
Figure 4. Figure 4: WebShop SFT performance versus expert trajectory count. view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative WebShop shopping trace with rendered state blocks. view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative SciWorld lab-procedure trace with rendered state blocks. view at source ↗
read the original abstract

Large language model (LLM) agents are increasingly used to operate browsers, files, code and tools, making personal assistants a natural deployment target. Yet personal agents face a privacy-cost-capability tension: cloud models execute multi-step workflows well but expose sensitive intermediate context to external APIs, while local models preserve privacy but remain less reliable. Both settings also pay repeatedly for long skill prompts and growing histories. We propose constant-context skill learning, a context-to-weights framework for recurring agent workflows: reusable procedures are learned in lightweight task-family modules, while inference conditions only on the current observation and a compact state block. A deterministic tracker renders this state block from task progress and supplies aligned subgoal rewards, so each module can be trained with step-level SFT and refined through online RL. Across ALFWorld, WebShop, and SciWorld, our agents achieve strong performance across Qwen3-4B, Qwen3-8B and Llama-3.1-8B. With Qwen3-8B, SFT+RL reaches 89.6\% unseen success on ALFWorld, 76.8\% success on WebShop, and 66.4\% unseen success on SciWorld. They match or exceed strong published agent-training results while reducing prompt tokens per turn by 2--7$\times$ relative to controlled ReAct prompting baselines, showing that procedural context can be moved from prompts into weights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes constant-context skill learning for LLM agents: recurring workflows are partitioned into lightweight task-family modules whose procedural history is moved into weights via SFT+RL training; at inference only the current observation and a compact state block (produced by a deterministic tracker that also supplies subgoal rewards) are provided. On ALFWorld, WebShop and SciWorld the method reports 89.6% unseen success (Qwen3-8B), 76.8% success and 66.4% unseen success respectively, matching or exceeding prior agent-training results while cutting prompt tokens per turn by 2–7× versus controlled ReAct baselines.

Significance. If the central claim holds, the work offers a practical route to privacy-preserving and token-efficient local agents by embedding reusable procedures in model weights rather than long prompts. The reported numbers on three distinct environments and multiple base models constitute a concrete empirical contribution, though their interpretability depends on the missing controls noted below.

major comments (2)
  1. [Methods (tracker and state-block construction)] The central claim that the deterministic tracker’s compact state block is informationally complete for unseen tasks is load-bearing, yet no ablations are described that remove or corrupt individual elements (subgoal progress flags, observation summaries, etc.) and measure the resulting drop in success rate or change in failure modes on held-out tasks. Without such evidence it remains possible that the 89.6% ALFWorld figure relies on environment-specific tracker engineering rather than a general constant-context mechanism.
  2. [Results / Experiments] The abstract and results sections report concrete success rates and token reductions but supply no details on baselines, variance across seeds, data splits, or exact training procedures (hyper-parameters, number of RL steps, reward shaping). This absence prevents verification that the numbers support the stated claims.
minor comments (1)
  1. [Results] The token-reduction claim (2–7×) would be clearer if accompanied by per-environment tables showing exact token counts for the ReAct baselines versus the proposed method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for identifying areas where additional evidence and transparency would strengthen the manuscript. We respond to each major comment below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: The central claim that the deterministic tracker’s compact state block is informationally complete for unseen tasks is load-bearing, yet no ablations are described that remove or corrupt individual elements (subgoal progress flags, observation summaries, etc.) and measure the resulting drop in success rate or change in failure modes on held-out tasks. Without such evidence it remains possible that the 89.6% ALFWorld figure relies on environment-specific tracker engineering rather than a general constant-context mechanism.

    Authors: We appreciate the referee highlighting the need to directly validate informational completeness. The tracker is deterministic by construction and encodes only task progress and subgoal information that is required for the current step, with no environment-specific heuristics beyond the task definition itself. The fact that the same framework yields strong unseen-task performance across three dissimilar environments (ALFWorld, WebShop, SciWorld) already indicates that the state block is not narrowly engineered. Nevertheless, we agree that targeted ablations would provide clearer evidence. In the revised manuscript we will add a dedicated ablation study that removes or perturbs individual state-block fields (subgoal flags, observation summaries, etc.) and reports the resulting success-rate drops and failure-mode shifts on held-out tasks. This will isolate the contribution of each element and further support the generality of the constant-context approach. revision: yes

  2. Referee: The abstract and results sections report concrete success rates and token reductions but supply no details on baselines, variance across seeds, data splits, or exact training procedures (hyper-parameters, number of RL steps, reward shaping). This absence prevents verification that the numbers support the stated claims.

    Authors: We agree that reproducibility requires explicit experimental details. While the manuscript already identifies the baselines as controlled ReAct prompting and describes the overall training pipeline (SFT followed by online RL with tracker-supplied rewards), we acknowledge that variance, split definitions, and hyper-parameter values are not reported at the level needed for verification. In the revision we will expand the Experiments and Results sections with: (i) success rates reported as mean ± standard deviation over multiple random seeds, (ii) precise descriptions of the seen/unseen data splits used for each environment, and (iii) a table listing all hyper-parameters, the number of RL steps, batch sizes, and the exact reward-shaping functions employed by the deterministic tracker. These additions will enable readers to fully reproduce and assess the reported success rates and token reductions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training recipe with no derivations or self-referential reductions.

full rationale

The paper describes an empirical method for constant-context skill learning via task-family modules, deterministic trackers for state blocks, SFT, and RL. No equations, closed-form derivations, or load-bearing self-citations are present that reduce predictions to fitted inputs or prior author results by construction. Results are benchmarked externally on ALFWorld, WebShop, and SciWorld against published baselines, satisfying the self-contained criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so free parameters, axioms, and invented entities cannot be exhaustively audited; the framework implicitly assumes recurring workflows exist and that a compact state suffices.

pith-pipeline@v0.9.0 · 5567 in / 1110 out tokens · 59703 ms · 2026-05-08T16:45:43.733804+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 17 canonical work pages · 8 internal anchors

  1. [1]

    Agent skills

    Anthropic. Agent skills. https://platform.claude.com/docs/en/agents-and-tools/ agent-skills/overview, 2026. Accessed: 2026-05-04

  2. [2]

    Claude opus 4.7

    Anthropic. Claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 , 2026. Accessed: 2026-05-01

  3. [3]

    Lightmem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866, 2025

    Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory-augmented generation. arXiv preprint arXiv:2510.18866, 2025

  4. [4]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

  5. [5]

    Agentrefine: Enhancing agent generalization through refinement tuning.arXiv preprint arXiv:2501.01702, 2025

    Dayuan Fu, Keqing He, Yejie Wang, Wentao Hong, Zhuoma Gongque, Weihao Zeng, Wei Wang, Jingang Wang, Xunliang Cai, and Weiran Xu. Agentrefine: Enhancing agent generalization through refinement tuning.arXiv preprint arXiv:2501.01702, 2025

  6. [6]

    Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey.arXiv preprint arXiv:2403.14608, 2024

  7. [7]

    Word-based dialog state tracking with recurrent neural networks

    Matthew Henderson, Blaise Thomson, and Steve Young. Word-based dialog state tracking with recurrent neural networks. InProceedings of the 15th annual meeting of the special interest group on discourse and dialogue (SIGDIAL), pages 292–299, 2014

  8. [8]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges- mundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019

  9. [9]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  10. [10]

    Llmlingua: Compressing prompts for accelerated inference of large language models

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 13358–13376, 2023

  11. [11]

    Hierarchical deep reinforce- ment learning: Integrating temporal abstraction and intrinsic motivation.Advances in neural information processing systems, 29, 2016

    Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforce- ment learning: Integrating temporal abstraction and intrinsic motivation.Advances in neural information processing systems, 29, 2016

  12. [12]

    Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

    Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, and Rema Padman. Beyond single-turn: A survey on multi-turn interactions with large language models.arXiv preprint arXiv:2504.04717, 2025

  13. [13]

    Sage: Self-evolving agents with reflective and memory-augmented abilities.Neurocomputing, 647:130470, 2025

    Xuechen Liang, Meiling Tao, Yinghui Xia, Jianhui Wang, Kun Li, Yijin Wang, Yangfan He, Jingsong Yang, Tianyu Shi, Yuantao Wang, et al. Sage: Self-evolving agents with reflective and memory-augmented abilities.Neurocomputing, 647:130470, 2025

  14. [14]

    Prompt compression with context-aware sentence encoding for fast and improved llm inference

    Barys Liskavets, Maxim Ushakov, Shuvendu Roy, Mark Klibanov, Ali Etemad, and Shane K Luke. Prompt compression with context-aware sentence encoding for fast and improved llm inference. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24595–24604, 2025

  15. [15]

    Large language model-based planning agent with generative memory strengthens performance in textualized world.Engineering Applications of Artificial Intelligence, 148:110319, 2025

    Junyang Liu, Wenning Hao, Kai Cheng, and Dawei Jin. Large language model-based planning agent with generative memory strengthens performance in textualized world.Engineering Applications of Artificial Intelligence, 148:110319, 2025

  16. [16]

    Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

  17. [17]

    Odyssey: Empowering minecraft agents with open-world skills.arXiv preprint arXiv:2407.15325, 2024

    Shunyu Liu, Yaoru Li, Kongcheng Zhang, Zhenyu Cui, Wenkai Fang, Yuxuan Zheng, Tongya Zheng, and Mingli Song. Odyssey: Empowering minecraft agents with open-world skills.arXiv preprint arXiv:2407.15325, 2024

  18. [18]

    Agentic critical training.arXiv preprint arXiv:2603.08706, 2026

    Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang, and Furong Huang. Agentic critical training.arXiv preprint arXiv:2603.08706, 2026

  19. [19]

    Exploratory memory-augmented llm agent via hybrid on-and off-policy optimization.arXiv preprint arXiv:2602.23008, 2026

    Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, and Yuqing Yang. Exploratory memory- augmented llm agent via hybrid on-and off-policy optimization.arXiv preprint arXiv:2602.23008, 2026. 10

  20. [20]

    Learning to compress prompts with gist tokens.Advances in Neural Information Processing Systems, 36:19327–19352, 2023

    Jesse Mu, Xiang Li, and Noah Goodman. Learning to compress prompts with gist tokens.Advances in Neural Information Processing Systems, 36:19327–19352, 2023

  21. [21]

    Policy invariance under reward transformations: Theory and application to reward shaping

    Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. InIcml, volume 99, pages 278–287. Citeseer, 1999

  22. [22]

    Introducing gpt-5.5

    OpenAI. Introducing gpt-5.5. https://openai.com/index/introducing-gpt-5-5/ , 2026. Accessed: 2026-05-01

  23. [23]

    Openclaw — personal ai assistant

    OpenClaw. Openclaw — personal ai assistant. https://openclaw.ai/, 2026. Accessed: 2026-05- 01

  24. [24]

    Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37:126544–126565, 2024

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37:126544–126565, 2024

  25. [25]

    Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

  26. [26]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  27. [27]

    Direct multi-turn preference optimization for language agents

    Wentao Shi, Mengqi Yuan, Junkang Wu, Qifan Wang, and Fuli Feng. Direct multi-turn preference optimization for language agents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2312–2324, 2024

  28. [28]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  29. [29]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

  30. [30]

    Remem: Reasoning with episodic memory in language agent.arXiv preprint arXiv:2602.13530, 2026

    Yiheng Shu, Saisri Padmaja Jonnalagedda, Xiang Gao, Bernal Jiménez Gutiérrez, Weijian Qi, Kamalika Das, Huan Sun, and Yu Su. Remem: Reasoning with episodic memory in language agent.arXiv preprint arXiv:2602.13530, 2026

  31. [31]

    Trial and error: Exploration- based trajectory optimization of llm agents

    Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. Trial and error: Exploration- based trajectory optimization of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7584–7600, 2024

  32. [32]

    Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2):181–211, 1999

    Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2):181–211, 1999

  33. [33]

    Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

  34. [34]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

  35. [35]

    Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

    Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

  36. [36]

    Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

    Lingling Xu, Haoran Xie, S Joe Qin, Xiaohui Tao, and Fu Lee Wang. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  37. [37]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

  38. [38]

    Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024. 11

  39. [39]

    Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35: 20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35: 20744–20757, 2022

  40. [40]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  41. [41]

    Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025

    Guibin Zhang, Muxin Fu, and Shuicheng Yan. Memgen: Weaving generative latent memory for self- evolving agents.arXiv preprint arXiv:2509.24704, 2025

  42. [42]

    Agent learning via early experience.arXiv preprint arXiv:2510.08558, 2025

    Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, et al. Agent learning via early experience.arXiv preprint arXiv:2510.08558, 2025

  43. [43]

    A probabilistic end-to-end task-oriented dialog model with latent belief states towards semi-supervised learning

    Yichi Zhang, Zhijian Ou, Min Hu, and Junlan Feng. A probabilistic end-to-end task-oriented dialog model with latent belief states towards semi-supervised learning. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9207–9219, 2020

  44. [44]

    A survey on the memory mechanism of large language model-based agents.ACM Transactions on Information Systems, 43(6):1–47, 2025

    Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji- Rong Wen. A survey on the memory mechanism of large language model-based agents.ACM Transactions on Information Systems, 43(6):1–47, 2025

  45. [45]

    arXiv preprint arXiv:2402.19446 , year=

    Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. Archer: Training language model agents via hierarchical multi-turn rl.arXiv preprint arXiv:2402.19446, 2024. 12 7 Technical appendices and supplementary material 7.1 Training Details and Hyperparameters All three benchmarks provide expert trajectories, which we use directly to constru...