HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning
Pith reviewed 2026-06-27 13:02 UTC · model grok-4.3
The pith
HIPIF trains LLM agents to break long tasks into subgoals and fold away completed histories to preserve reasoning quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HIPIF trains the agent end-to-end to organize long-horizon execution around explicit subgoals while folding completed subgoal histories to reduce long-context interference. To stabilize subgoal-based planning and execution, it combines hierarchical reflection and subgoal-oriented process rewards to guide subgoal generation, transition, and execution, without relying on costly auxiliary models or task-specific expert trajectories.
What carries the argument
Information folding, which summarizes and removes completed subgoal histories while keeping only the information needed for ongoing global state tracking and future decisions.
If this is right
- Agents maintain better global task state tracking across extended sequences without external supervision.
- Subgoal generation and transitions become more stable through the combination of reflection and process rewards.
- Performance gains appear on standard agent benchmarks without task-specific expert trajectories or auxiliary models.
- End-to-end training directly optimizes the agent for reduced long-context interference.
Where Pith is reading between the lines
- The folding approach could be tested on non-agent LLM tasks that involve long documents or multi-step reasoning chains.
- If folding works reliably, it might allow smaller context windows to suffice for complex agent workloads.
- Future work could measure whether the learned subgoal rewards transfer across different benchmark domains.
Load-bearing premise
Folding completed subgoal histories removes only non-essential information and does not discard details required for correct future reasoning or global state tracking.
What would settle it
If agents trained with HIPIF still show degraded performance on long-horizon benchmarks where early subgoal details are required for later correct decisions, compared to unfolder baselines, the central claim would be falsified.
Figures
read the original abstract
While Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents across a wide range of tasks, their performance often degrades in multi-turn long-horizon agentic tasks. Existing methods have made progress through fine-grained credit assignment to alleviate long-horizon sparse rewards and hierarchical reinforcement learning to decompose tasks and reduce long-term dependency. However, these methods still do not directly address long-context interference, in which continuously growing histories weaken the agent's ability to track the global task state and impair subsequent reasoning and decision-making. Inspired by the way humans handle complex tasks through subgoal decomposition and completed progress summarization, we propose Hierarchical Planning and Information Folding (HIPIF) for long-horizon LLM agent learning. HIPIF trains the agent end-to-end to organize long-horizon execution around explicit subgoals while folding completed subgoal histories to reduce long-context interference. Furthermore, to stabilize subgoal-based planning and execution, HIPIF combines hierarchical reflection and subgoal-oriented process rewards to guide subgoal generation, transition, and execution, without relying on costly auxiliary models or task-specific expert trajectories. Extensive experiments on three publicly available agentic benchmarks demonstrate the validity of our method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Hierarchical Planning and Information Folding (HIPIF) for long-horizon LLM agent learning. The method trains agents end-to-end to decompose tasks into explicit subgoals, fold completed subgoal histories to mitigate long-context interference, and stabilize planning via hierarchical reflection combined with subgoal-oriented process rewards. It claims this avoids reliance on auxiliary models or expert trajectories and demonstrates validity through experiments on three public agentic benchmarks.
Significance. If the folding mechanism can be shown to preserve global state details while reducing interference, the approach would offer a practical way to scale LLM agents to longer horizons without external supervision, building on hierarchical RL ideas but addressing context management directly. The end-to-end training and benchmark results, if reproducible with ablations, would strengthen claims about stable subgoal generation and execution.
major comments (3)
- [§3.2] §3.2 (Information Folding): The central claim that folding removes only non-essential content while preserving details needed for future reasoning and global state tracking lacks a concrete mechanism (e.g., selection criteria, learned compressor, or rule-based summary) or proof that irreversible compression does not discard facts from subgoal i that affect subgoal k >> i. This is load-bearing for the long-context interference solution.
- [§4] §4 (Experiments): No ablation isolating the folding component is reported; without it, performance gains on the three benchmarks cannot be attributed to folding versus hierarchical reflection or process rewards alone, weakening the claim that folding specifically reduces interference.
- [§3.3] §3.3 (Hierarchical Reflection and Process Rewards): The process reward formulation and how it guides subgoal transition/execution without task-specific tuning is underspecified; if the rewards are derived from the same LLM, the stability claim risks circularity in the absence of external verification.
minor comments (2)
- [Abstract] Abstract and §1: The phrase 'demonstrate the validity of our method' is vague; replace with specific metrics or improvements over baselines.
- Notation: Define all acronyms (e.g., HIPIF components) on first use and ensure consistent use of 'subgoal' vs. 'sub-goal' throughout.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications where possible and committing to revisions that strengthen the manuscript without overstating our current results.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Information Folding): The central claim that folding removes only non-essential content while preserving details needed for future reasoning and global state tracking lacks a concrete mechanism (e.g., selection criteria, learned compressor, or rule-based summary) or proof that irreversible compression does not discard facts from subgoal i that affect subgoal k >> i. This is load-bearing for the long-context interference solution.
Authors: We agree that §3.2 would benefit from greater specificity. The folding operation in HIPIF is implemented as a deterministic, rule-based procedure that retains only the subgoal outcome, key state deltas reported by the environment, and a one-sentence summary of execution status; all intermediate LLM-generated thoughts and action traces are discarded. We will expand the section with pseudocode and a worked example demonstrating that global state variables required for later subgoals are explicitly preserved by the rule set. A formal proof that no critical fact is ever lost is not feasible without task-specific assumptions, so we will instead emphasize the empirical safeguards (hierarchical reflection and environment-verified rewards) that mitigate any residual risk. revision: yes
-
Referee: [§4] §4 (Experiments): No ablation isolating the folding component is reported; without it, performance gains on the three benchmarks cannot be attributed to folding versus hierarchical reflection or process rewards alone, weakening the claim that folding specifically reduces interference.
Authors: The referee correctly identifies a missing control. We will add a new ablation in the revised §4 that compares the full HIPIF model against an otherwise identical variant that retains complete subgoal histories (no folding). Results will be reported on all three benchmarks together with statistical significance tests, allowing readers to isolate the contribution of the folding mechanism. revision: yes
-
Referee: [§3.3] §3.3 (Hierarchical Reflection and Process Rewards): The process reward formulation and how it guides subgoal transition/execution without task-specific tuning is underspecified; if the rewards are derived from the same LLM, the stability claim risks circularity in the absence of external verification.
Authors: We will clarify in the revision that process rewards are computed directly from environment feedback (binary success/failure signals on each declared subgoal plus a small shaping term based on progress metrics provided by the benchmark environments). The LLM is used only for reflection and next-subgoal proposal; it does not generate the reward values. This external grounding removes the circularity concern. The exact reward equation and the fact that no task-specific tuning is applied beyond the shared environment interface will be stated explicitly in §3.3. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces HIPIF as an end-to-end trained method using subgoal organization, information folding, hierarchical reflection, and process rewards, evaluated on three external public benchmarks without reliance on auxiliary models or expert trajectories. The provided abstract and description contain no equations, fitted parameters presented as predictions, self-citations that bear the central claim, or uniqueness theorems imported from prior author work. The derivation chain is self-contained against external benchmarks rather than reducing to internal redefinitions or self-referential fits.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024
2024
-
[2]
Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021
2021
-
[3]
Agent-flan: Designing data and methods of effective agent tuning for large language models
Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. Agent-flan: Designing data and methods of effective agent tuning for large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9354–9366, 2024
2024
-
[4]
Group-in-group policy optimization for llm agent training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[5]
Segment policy optimization: Effective segment-level credit assignment in rl for large language models
Yiran Guo, Lijie Xu, Jie Liu, Ye Dan, and Shuang Qiu. Segment policy optimization: Effective segment-level credit assignment in rl for large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[6]
Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model
Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large lan- guage model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32779–32798, 2025
2025
-
[7]
Divide and conquer: Grounding llms as efficient decision-making agents via offline hierarchical reinforcement learning
Zican Hu, Wei Liu, Xiaoye Qu, Xiangyu Yue, Chunlin Chen, Zhi Wang, and Yu Cheng. Divide and conquer: Grounding llms as efficient decision-making agents via offline hierarchical reinforcement learning. InInternational Conference on Machine Learning, pages 24570–24590. PMLR, 2025
2025
-
[8]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Pre-trained language models for interactive decision-making
Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, et al. Pre-trained language models for interactive decision-making. InAdvances in Neural Information Processing Systems, 2022
2022
-
[10]
Deepagent: A general reasoning agent with scalable toolsets
Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, et al. Deepagent: A general reasoning agent with scalable toolsets. InProceedings of the ACM Web Conference 2026, pages 2219–2230, 2026
2026
-
[11]
Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks.Advances in Neural Information Processing Systems, 36:23813–23825, 2023
Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, and Xiang Ren. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks.Advances in Neural Information Processing Systems, 36:23813–23825, 2023
2023
-
[12]
Gradient coupling: The hidden barrier to generalization in agentic reinforcement learning
Jingyu Liu, Xiaopeng Wu, Jingquan Peng, Kehan Chen, Chuan Yu, Lizhong Ding, and Yong Liu. Gradient coupling: The hidden barrier to generalization in agentic reinforcement learning. arXiv preprint arXiv:2509.23870, 2025
-
[13]
Piper: Benchmarking and prompting event reasoning boundary of llms via debiasing-distillation enhanced tuning
Zhicong Lu, Changyuan Tian, PeiguangLi PeiguangLi, Li Jin, Sirui Wang, Wei Jia, Ying Shen, and Guangluan Xu. Piper: Benchmarking and prompting event reasoning boundary of llms via debiasing-distillation enhanced tuning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28591–28613, 2025
2025
-
[14]
Zhicong Lu, Zichuan Lin, Wei Jia, Changyuan Tian, Deheng Ye, Peiguang Li, Li Jin, Nayu Liu, Guangluan Xu, and Wei Feng. Hisr: Hindsight information modulated segmental process rewards for multi-turn agentic reinforcement learning.arXiv preprint arXiv:2603.18683, 2026. 10
-
[15]
When do transformers shine in rl? decoupling memory from credit assignment.Advances in Neural Information Processing Systems, 36:50429–50452, 2023
Tianwei Ni, Michel Ma, Benjamin Eysenbach, and Pierre-Luc Bacon. When do transformers shine in rl? decoupling memory from credit assignment.Advances in Neural Information Processing Systems, 36:50429–50452, 2023
2023
-
[16]
Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, and Mingyi Hong. Hiper: Hierarchical reinforcement learning with explicit credit assignment for large language model agents.arXiv preprint arXiv:2602.16165, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Virtualhome: Simulating household activities via programs
Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8494–8502, 2018
2018
-
[18]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
2023
-
[19]
Vlm agents generate their own memories: Distilling experience into embodied programs of thought.Advances in Neural Information Processing Systems, 37:75942–75985, 2024
Gabriel Sarch, Lawrence Jang, Michael J Tarr, William W Cohen, Kenneth Marino, and Katerina Fragkiadaki. Vlm agents generate their own memories: Distilling experience into embodied programs of thought.Advances in Neural Information Processing Systems, 37:75942–75985, 2024
2024
-
[20]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
2023
-
[23]
Alfworld: Aligning text and embodied environments for interactive learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations
-
[24]
Alfred: A benchmark for interpreting grounded instructions for everyday tasks
Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mot- taghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020
2020
-
[25]
Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. Trial and error: Exploration-based trajectory optimization for llm agents.arXiv preprint arXiv:2403.02502, 2024
-
[26]
Scaling long-horizon LLM agent via context-folding.arXiv preprint arXiv:2510.11967, 2025
Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon llm agent via context-folding.arXiv preprint arXiv:2510.11967, 2025
-
[27]
Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999
Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999
1999
-
[28]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[30]
V oyager: An open-ended embodied agent with large language models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research. 11
-
[31]
arXiv preprint arXiv:2505.20732 , year=
Hanlin Wang, Chak Tou Leong, Jiashuo Wang, Jian Wang, and Wenjie Li. Spa-rl: Reinforcing llm agents via stepwise progress attribution.arXiv preprint arXiv:2505.20732, 2025
-
[32]
Steca: Step-level trajectory calibration for llm agent learning
Hanlin Wang, Jian Wang, Chak Tou Leong, and Wenjie Li. Steca: Step-level trajectory calibration for llm agent learning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11597–11614, 2025
2025
-
[33]
Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022
Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022
2022
-
[34]
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
2022
-
[36]
A-mem: Agentic memory for LLM agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for LLM agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=FiM0M8gcct
2026
-
[37]
Language agents with reinforcement learning for strategic play in the werewolf game
Zelai Xu, Chao Yu, Fei Fang, Yu Wang, and Yi Wu. Language agents with reinforcement learning for strategic play in the werewolf game. InInternational Conference on Machine Learning, pages 55434–55464. PMLR, 2024
2024
-
[38]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023
2023
-
[40]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023
2023
-
[41]
Agentfold: Long-horizon web agents with proactive context folding
Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Jingren Zhou, Siheng Chen, and Yong Jiang. Agentfold: Long-horizon web agents with proactive context folding. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview....
2026
-
[42]
A survey on the memory mechanism of large language model-based agents
Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025
2025
-
[43]
Epo: Hierarchical llm agents with environment preference optimization
Qi Zhao, Haotian Fu, Chen Sun, and George Konidaris. Epo: Hierarchical llm agents with environment preference optimization. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6401–6415, 2024
2024
-
[44]
Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents
Shuai Zhen, Yanhua Yu, Ruopei Guo, Nan Cheng, and Yang Deng. Hierarchical reinforcement learning with augmented step-level transitions for llm agents.arXiv preprint arXiv:2604.05808, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
Archer: Training language model agents via hierarchical multi-turn rl
Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. Archer: Training language model agents via hierarchical multi-turn rl. InForty-first International Conference on Machine Learning,
-
[46]
Nothing happens
Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents. InFirst Workshop on Multi-Turn Interactions in Large Language Models, . 12 A Datasets Details ALFWorld.ALFWorld [ 23] is an embodied text-based ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.