Multi²: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments
Pith reviewed 2026-06-28 11:28 UTC · model grok-4.3
The pith
Hierarchical separation of high-level SFT planning and low-level RL execution lets LLM agents maintain stable goals across long interactions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Multi² decomposes agent behavior into a high-level System 1 agent that generates context-aware sub-goals through supervised fine-tuning and a low-level System 2 agent that executes atomic actions through offline-to-online reinforcement learning. This explicit role separation produces stable long-horizon control and reduces objective drift without requiring additional inter-agent feedback mechanisms.
What carries the argument
The hierarchical multi-agent framework that assigns sub-goal generation to an SFT-trained high-level agent and action execution to an RL-trained low-level agent.
If this is right
- Long-horizon tasks become more reliable because the high-level agent can reset or adjust sub-goals periodically.
- Training remains efficient since the SFT and RL components can be developed separately.
- New hierarchical benchmark datasets enable standardized evaluation of similar multi-level agent designs.
- Adaptation to new environments improves because only the low-level policy needs online updates.
Where Pith is reading between the lines
- Similar hierarchies might reduce drift in other sequential decision domains like robotics or dialogue systems.
- Testing coordination under noisy high-level outputs would reveal how robust the separation really is.
- The released benchmarks could support comparisons with single-agent methods that incorporate explicit memory.
- If the low-level agent is replaced with a different controller, the high-level planning might still transfer.
Load-bearing premise
The high-level and low-level agents can coordinate effectively on the global objective even when trained mostly independently and without built-in error recovery or feedback loops between them.
What would settle it
In a controlled test, if the low-level agent frequently fails to achieve the sub-goals set by the high-level agent, or if overall performance does not exceed non-hierarchical baselines on long-horizon tasks.
Figures
read the original abstract
A central goal of large language model (LLM) research is to build agentic systems that can plan, act, and adapt through sustained interaction with dynamic environments. While recent LLM-based agents exhibit impressive contextual reasoning, their long-horizon decision-making remains fragile, often suffering from objective drift, where goals and plans drift over extended interactions. We introduce Multi$^2$, a hierarchical multi-agent decision-making framework that explicitly decomposes agent behavior into complementary roles. A high-level agent (System 1) focuses on context-aware sub-goal generation using supervised fine-tuning (SFT), while a low-level agent (System 2) executes atomic actions through offline-to-online reinforcement learning (RL) in interactive environments. This separation enables stable long-horizon control, mitigates objective drift, and allows efficient adaptation. Across diverse interactive environments, Multi$^2$ consistently outperforms strong agentic baselines, demonstrating improved robustness and coordination in multi-turn interaction. Beyond performance, we introduce and release three hierarchical benchmark datasets, filling a long-standing gap in training and evaluating hierarchical decision-making for LLM-based agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Multi², a hierarchical multi-agent framework for LLM-based agents in interactive environments. A high-level System 1 agent uses supervised fine-tuning (SFT) for context-aware sub-goal generation, while a low-level System 2 agent uses offline-to-online RL for atomic action execution. The separation is claimed to enable stable long-horizon control, mitigate objective drift, and support efficient adaptation. The work reports consistent outperformance over strong agentic baselines across diverse environments and releases three new hierarchical benchmark datasets for training and evaluating such agents.
Significance. If the empirical results hold with proper controls and statistical support, the hierarchical decomposition could offer a practical route to more robust long-horizon LLM agents by separating planning from execution. The release of three benchmark datasets is a concrete contribution that addresses a documented gap in hierarchical decision-making evaluation.
major comments (2)
- [Abstract (framework paragraph)] Abstract, paragraph describing the framework: the central claim that the SFT/RL separation 'enables stable long-horizon control, mitigates objective drift' is load-bearing, yet the provided description states the agents 'can be trained largely independently' without specifying inter-agent signaling, replanning triggers, error-detection mechanisms, or closed-loop feedback. If low-level execution deviates due to stochasticity or sub-goal ambiguity, no pathway for high-level correction is described; this directly affects the drift-mitigation guarantee.
- [Abstract] Abstract: the assertion of 'consistent outperformance' and 'improved robustness' across environments is presented without any metrics, baseline descriptions, statistical tests, or experimental details in the available text. The soundness of the empirical claim therefore cannot be assessed from the supplied material.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate whether revisions will be made.
read point-by-point responses
-
Referee: [Abstract (framework paragraph)] Abstract, paragraph describing the framework: the central claim that the SFT/RL separation 'enables stable long-horizon control, mitigates objective drift' is load-bearing, yet the provided description states the agents 'can be trained largely independently' without specifying inter-agent signaling, replanning triggers, error-detection mechanisms, or closed-loop feedback. If low-level execution deviates due to stochasticity or sub-goal ambiguity, no pathway for high-level correction is described; this directly affects the drift-mitigation guarantee.
Authors: Section 3 of the manuscript details the interaction protocol, including a shared context buffer for sub-goal communication, execution-status feedback that triggers high-level replanning, and reward-based error detection for closed-loop correction. The abstract's brevity omitted these elements. We will revise the abstract to briefly reference the closed-loop feedback mechanism, thereby strengthening support for the drift-mitigation claim. revision: yes
-
Referee: [Abstract] Abstract: the assertion of 'consistent outperformance' and 'improved robustness' across environments is presented without any metrics, baseline descriptions, statistical tests, or experimental details in the available text. The soundness of the empirical claim therefore cannot be assessed from the supplied material.
Authors: Abstracts conventionally summarize results without full metrics or statistics; these are reported in Section 4, which includes performance tables, baseline descriptions, and statistical significance tests across environments. The claims are substantiated in the body of the paper. No revision to the abstract is required. revision: no
Circularity Check
No circularity; empirical framework without derivation chain
full rationale
The paper introduces an empirical hierarchical multi-agent framework separating high-level SFT for sub-goal generation from low-level RL for action execution. No equations, derivations, or mathematical reductions are present that would equate claimed outcomes (e.g., drift mitigation) to fitted parameters or self-defined quantities within the paper. Performance claims rest on experimental comparisons across environments rather than any self-referential derivation. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the abstract or described structure. The design is presented as an architectural choice validated empirically, with no reduction of predictions to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
En- hancing decision-making of large language models via actor-critic
Heng Dong, Kefei Duan, and Chongjie Zhang. En- hancing decision-making of large language models via actor-critic. InInternational Conference on Machine Learning (ICML), 2025
2025
-
[2]
CollabLLM: From passive responders to active collaborators
Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao. CollabLLM: From passive responders to active collaborators. InInter- national Conference on Machine Learning (ICML), 2025
2025
-
[3]
DAMA: Data- and model-aware align- ment of multi-modal LLMs
Jinda Lu, Junkang Wu, Jinghan Li, Xiaojun Jia, Shuo Wang, YiFan Zhang, Junfeng Fang, Xiang Wang, and Xiangnan He. DAMA: Data- and model-aware align- ment of multi-modal LLMs. InInternational Confer- ence on Machine Learning (ICML), 2025
2025
-
[4]
Inverse rational control with par- tially observable continuous nonlinear dynamics
Minhae Kwon, Saurabh Daptardar, Paul R Schrater, and Xaq Pitkow. Inverse rational control with par- tially observable continuous nonlinear dynamics. In Advances in Neural Information Processing Systems (NeurIPS), 2020
2020
-
[5]
QuBE: Question-based belief enhancement for agentic LLM reasoning
Minsoo Kim, Jongyoon Kim, Jihyuk Kim, and Seung Hwang. QuBE: Question-based belief enhancement for agentic LLM reasoning. InEmpirical Methods in Natural Language Processing (EMNLP), 2024
2024
-
[6]
Agentic reasoning: A streamlined frame- work for enhancing LLM reasoning with agentic tools
Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. Agentic reasoning: A streamlined frame- work for enhancing LLM reasoning with agentic tools. InAssociation for Computational Linguistics (ACL), 2025
2025
-
[7]
T1: Advancing language model reasoning through reinforcement learning and inference scaling
Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, and Yuxiao Dong. T1: Advancing language model reasoning through reinforcement learning and inference scaling. InIn- ternational Conference on Machine Learning (ICML), 2025
2025
-
[8]
Episodic future think- ing mechanism for multi-agent reinforcement learning
Dongsu Lee and Minhae Kwon. Episodic future think- ing mechanism for multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[9]
ReCAP: Recursive context-aware reasoning and planning for large language model agents
Zhenyu Zhang, Tianyi Chen, Weiran Xu, Alex Pent- land, and Jiaxin Pei. ReCAP: Recursive context-aware reasoning and planning for large language model agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
2025
-
[10]
Evaluating llm-based agents for multi-turn conversations: A survey, 2025
Shengyue Guan, Jindong Wang, Jiang Bian, Bin Zhu, Jian Lou, and Haoyi Xiong. Evaluating LLM-based agents for multi-turn conversations: A survey.arXiv preprint arXiv:2503.22458, 2025
-
[11]
Path drift in large reasoning models: How first-person commitments override safety
Yuyi Huang, Runzhe Zhan, Lidia Chao, Ailin Tao, and Derek Wong. Path drift in large reasoning models: How first-person commitments override safety. In Empirical Methods in Natural Language Processing (EMNLP), 2025
2025
-
[12]
Drift no more? Context equilibria in multi-turn LLM interac- tions
Vardhan Dongre, Ryan Rossi, Viet Lai, Seunghyun Yoon, Dilek Hakkani-Tür, and Trung Bui. Drift no more? Context equilibria in multi-turn LLM interac- tions. InAAAI Personalization in the Era of Large Foundation Models Workshop, 2025
2025
-
[13]
Do as I can, not as I say: Grounding language in robotic affordances
Brian Ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, Omar Co...
2023
-
[14]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representa- tions (ICLR), 2023
2023
-
[15]
Divide and conquer: Grounding LLMs as efficient decision-making agents via offline hierarchical reinforcement learning
Zican Hu, Wei Liu, Xiaoye Qu, Xiangyu Yue, Chunlin Chen, Zhi Wang, and Yu Cheng. Divide and conquer: Grounding LLMs as efficient decision-making agents via offline hierarchical reinforcement learning. InIn- ternational Conference on Machine Learning (ICML), 2025. 10 Multi2: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments
2025
-
[16]
ADaPT: As-needed decomposition and planning with language models
Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. ADaPT: As-needed decomposition and planning with language models. InFindings of the Association for Computational Linguistics (NAACL), 2024
2024
-
[17]
Plan-and-Act: Improving planning of agents for long-horizon tasks
Lutfi Erdogan, Hiroki Furuta, Sehoon Kim, Nicholas Lee, Suhong Moon, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. Plan-and-Act: Improving planning of agents for long-horizon tasks. InInter- national Conference on Machine Learning (ICML), 2025
2025
-
[18]
The illusion of diminishing returns: Measuring long horizon execution in LLMs
Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, and Jonas Geiping. The illusion of diminishing returns: Measuring long horizon execution in LLMs. InInternational Conference on Learning Representa- tions (ICLR), 2026
2026
-
[19]
Agent-oriented planning in multi-agent systems
Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. Agent-oriented planning in multi-agent systems. InInternational Conference on Learning Representations (ICLR), 2025
2025
-
[20]
Multi-agent collaboration via evolv- ing orchestration
Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zi- hao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, and Maosong Sun. Multi-agent collaboration via evolv- ing orchestration. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
2025
-
[21]
Agentic AI: The age of reason- ing—A review.Journal of Automation and Intelli- gence, 2025
Ume Nisa, Muhammad Shirazi, Mohamed Saip, and Muhammad Pozi. Agentic AI: The age of reason- ing—A review.Journal of Automation and Intelli- gence, 2025
2025
-
[22]
ScienceWorld: Is your agent smarter than a 5th grader? InEmpirical Methods in Natural Language Processing (EMNLP), 2022
Ruoyao Wang, Peter Jansen, Marc Côté, and Prithviraj Ammanabrolu. ScienceWorld: Is your agent smarter than a 5th grader? InEmpirical Methods in Natural Language Processing (EMNLP), 2022
2022
-
[23]
OASIS: Open-world adaptive self-supervised and imbalanced- aware system
Miru Kim, Mugon Joe, and Minhae Kwon. OASIS: Open-world adaptive self-supervised and imbalanced- aware system. InACM International Conference on Information and Knowledge Management (CIKM), 2025
2025
-
[24]
Improving network attack classification on imbalanced real-world intrusion incident datasets
Miru Kim, Mugon Joe, and Minhae Kwon. Improving network attack classification on imbalanced real-world intrusion incident datasets. InInternational Confer- ence on Mobile Systems, Applications and Services (MobiSys), 2025
2025
-
[25]
Con- trastive learning based network attack classifier for imbalanced data.Journal of Communications and Networks, 28(1):86–97, Feb
Mugon Joe, Miru Kim, and Minhae Kwon. Con- trastive learning based network attack classifier for imbalanced data.Journal of Communications and Networks, 28(1):86–97, Feb. 2026
2026
-
[26]
Per- sonalized split federated learning with early exit: Pre- training and online learning against label shifts.IEEE Internet of Things Journal, 12(22):47069–47082, Nov
Miru Kim, Heewon Park, and Minhae Kwon. Per- sonalized split federated learning with early exit: Pre- training and online learning against label shifts.IEEE Internet of Things Journal, 12(22):47069–47082, Nov. 2025
2025
-
[27]
Fed-ADE: Adaptive learning rate for federated post-adaptation under distribution shift
Heewon Park, Mugon Joe, Miru Kim, Kyungjin Im, and Minhae Kwon. Fed-ADE: Adaptive learning rate for federated post-adaptation under distribution shift. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
2026
-
[28]
Evolving intelligent network attack classifier under label distri- bution shift.IEEE Transactions on Network Science and Engineering, 13(1):7448–7464, Mar
Miru Kim, Mugon Joe, and Minhae Kwon. Evolving intelligent network attack classifier under label distri- bution shift.IEEE Transactions on Network Science and Engineering, 13(1):7448–7464, Mar. 2026
2026
-
[29]
Personal- ized federated sensing for heterogeneous environment
Heewon Park, Miru Kim, and Minhae Kwon. Personal- ized federated sensing for heterogeneous environment. IEEE Sensors Letters, 9(4):1–4, 2025
2025
-
[30]
Editable scene simulation for autonomous driving via collaborative LLM-agents
Yuxi Wei, Zi Wang, Yifan Lu, Chenxin Xu, Changx- ing Liu, Hao Zhao, Siheng Chen, and Yanfeng Wang. Editable scene simulation for autonomous driving via collaborative LLM-agents. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
2024
-
[31]
DriVLMe: Enhancing LLM- based autonomous driving agents with embodied and social experiences
Yidong Huang, Jacob Sansom, Ziqiao Ma, Felix Gervits, and Joyce Chai. DriVLMe: Enhancing LLM- based autonomous driving agents with embodied and social experiences. InIEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), 2024
2024
-
[32]
SToRM: Supervised token reduction for multi-modal LLMs toward efficient end- to-end autonomous driving
Seo Hyun Kim, Jin Bok Park, Do Yeon Koo, Hogun Park, and Il Yong Chun. SToRM: Supervised token reduction for multi-modal LLMs toward efficient end- to-end autonomous driving. InIEEE International Conference on Robotics and Automation (ICRA), 2026
2026
-
[33]
ASAP: Unsupervised post-training with label distribution shift adaptive learning rate
Heewon Park, Mugon Joe, Miru Kim, and Minhae Kwon. ASAP: Unsupervised post-training with label distribution shift adaptive learning rate. InACM Inter- national Conference on Information and Knowledge Management (CIKM), 2025
2025
-
[34]
Reflexion: Lan- guage agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Lan- guage agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[35]
Make your LLM fully utilize the context
Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian Lou, and Weizhu Chen. Make your LLM fully utilize the context. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 11 Multi2: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments
2024
-
[36]
Toward self- improvement of LLMs via imagination, searching, and criticizing
Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. Toward self- improvement of LLMs via imagination, searching, and criticizing. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2024
2024
-
[37]
The lighthouse of language: Enhancing LLM agents via critique-guided improvement
Ruihan Yang, Fanghua Ye, Jian Li, Siyu Yuan, Yikai Zhang, Zhaopeng Tu, Xiaolong Li, and Deqing Yang. The lighthouse of language: Enhancing LLM agents via critique-guided improvement. InAdvances in Neu- ral Information Processing Systems (NeurIPS), 2025
2025
-
[38]
The alignment problem from a deep learning perspec- tive
Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspec- tive. InInternational Conference on Learning Repre- sentations (ICLR), 2024
2024
-
[39]
ArCHer: Training language model agents via hierarchical multi-turn RL
Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. ArCHer: Training language model agents via hierarchical multi-turn RL. InInternational Conference on Machine Learning (ICML), 2024
2024
-
[40]
Robust hierar- chical anomaly detection using feature impact in iot networks.ICT Express, 11(2):358–363, Apr
Joohong Rheey and Hyunggon Park. Robust hierar- chical anomaly detection using feature impact in iot networks.ICT Express, 11(2):358–363, Apr. 2025
2025
-
[41]
Chak Shek and Pratap Tokekar. Option discovery us- ing LLM-guided semantic hierarchical reinforcement learning.arXiv preprint arXiv:2503.19007, 2025
-
[42]
Leveraging imitation learning and LLMs for efficient hierarchical reinforcement learning
Runhan Yang, Jieao Shi, Mengqi Su, and Don- gruo Zhou. Leveraging imitation learning and LLMs for efficient hierarchical reinforcement learning. https://openreview.net/forum?id=6y00rooi7i, 2025
2025
-
[43]
Getting more juice out of the SFT data: Reward learning from human demonstration improves SFT for LLM alignment
Jiaxiang Li, Siliang Zeng, Hoi Wai, Chenliang Li, Alfredo Garcia, and Mingyi Hong. Getting more juice out of the SFT data: Reward learning from human demonstration improves SFT for LLM alignment. In Advances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[44]
Large lan- guage models as generalizable policies for embodied tasks
Andrew Szot, Max Schwarzer, Harsh Agrawal, Bog- dan Mazoure, Rin Metcalf, Walter Talbott, Natalie Mackraz, R Hjelm, and Alexander Toshev. Large lan- guage models as generalizable policies for embodied tasks. InInternational Conference on Learning Repre- sentations (ICLR), 2024
2024
-
[45]
Unlocking LLMs’ self-improvement capacity with autonomous learning for domain adaptation
Ke Ji, Junying Chen, Anningzhe Gao, Wenya Xie, Xiang Wan, and Benyou Wang. Unlocking LLMs’ self-improvement capacity with autonomous learning for domain adaptation. InFindings of the Association for Computational Linguistics (ACL), 2025
2025
-
[46]
Data mix- ing optimization for supervised fine-tuning of large language models
Yuan Li, Zhengzhong Liu, and Eric Xing. Data mix- ing optimization for supervised fine-tuning of large language models. InInternational Conference on Ma- chine Learning (ICML), 2025
2025
-
[47]
Wei Lu, Rachel Luu, and Markus Buehler. Fine-tuning large language models for domain adaptation: Explo- ration of training strategies, scaling, model merging and synergistic capabilities.npj Computational Mate- rials, 11(1):84, 2025
2025
-
[48]
Coevolving with the other you: Fine-tuning LLM with sequential coopera- tive multi-agent reinforcement learning
Hao Ma, Tianyi Hu, Zhiqiang Pu, Boyin Liu, Xiaolin Ai, Yanyan Liang, and Min Chen. Coevolving with the other you: Fine-tuning LLM with sequential coopera- tive multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[49]
Instant inverse mod- eling of stochastic driving behavior with deep rein- forcement learning.IEEE Transactions on Consumer Electronics, 71(1):2152–2162, Feb
Dongsu Lee and Minhae Kwon. Instant inverse mod- eling of stochastic driving behavior with deep rein- forcement learning.IEEE Transactions on Consumer Electronics, 71(1):2152–2162, Feb. 2025
2025
-
[50]
Control- ling large language model with latent action
Chengxing Jia, Ziniu Li, Pengyuan Wang, Yi Li, Zhenyu Hou, Yuxiao Dong, and Yang Yu. Control- ling large language model with latent action. InIn- ternational Conference on Machine Learning (ICML), 2025
2025
-
[51]
Stability analysis in mixed-autonomous traffic with deep reinforcement learning.IEEE Transactions on Vehicular Technology, 72(3):2848–2862, Mar
Dongsu Lee and Minhae Kwon. Stability analysis in mixed-autonomous traffic with deep reinforcement learning.IEEE Transactions on Vehicular Technology, 72(3):2848–2862, Mar. 2023
2023
-
[52]
QLASS: Boost- ing language agent inference via Q-guided stepwise search
Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, and Kai Chang. QLASS: Boost- ing language agent inference via Q-guided stepwise search. InInternational Conference on Machine Learn- ing (ICML), 2025
2025
-
[53]
Temporal distance- aware transition augmentation for offline model-based reinforcement learning
Dongsu Lee and Minhae Kwon. Temporal distance- aware transition augmentation for offline model-based reinforcement learning. InInternational Conference on Machine Learning (ICML), 2025
2025
-
[54]
Online reinforcement learning in stochastic games
Chen Wei, Yi Hong, and Chi Lu. Online reinforcement learning in stochastic games. InAdvances in Neural Information Processing Systems (NeurIPS), 2017
2017
-
[55]
Efficient online reinforcement learning with offline data
Philip Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning (ICML), 2023
2023
-
[56]
Leveraging offline data in online reinforcement learning
Andrew Wagenmaker and Aldo Pacchiano. Leveraging offline data in online reinforcement learning. InIn- ternational Conference on Machine Learning (ICML), 2023
2023
-
[57]
Continuous control with deep reinforcement learning
Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2016. 12 Multi2: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments
2016
-
[58]
Foresighted decisions for inter-vehicle interactions: An offline reinforcement learning approach
Dongsu Lee and Minhae Kwon. Foresighted decisions for inter-vehicle interactions: An offline reinforcement learning approach. InIEEE International Conference on Intelligent Transportation Systems (ITSC), 2023
2023
-
[59]
Ad- dressing function approximation error in actor-critic methods
Scott Fujimoto, Herke Hoof, and David Meger. Ad- dressing function approximation error in actor-critic methods. InInternational Conference on Machine Learning (ICML), 2018
2018
-
[60]
Selec- tive imitation for efficient online reinforcement learn- ing with pre-collected data.ICT Express, 10(6):1308– 1314, Dec
Chanin Eom, Dongsu Lee, and Minhae Kwon. Selec- tive imitation for efficient online reinforcement learn- ing with pre-collected data.ICT Express, 10(6):1308– 1314, Dec. 2024
2024
-
[61]
Of- fline reinforcement learning with implicit Q-learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Of- fline reinforcement learning with implicit Q-learning. InInternational Conference on Learning Representa- tions (ICLR), 2022
2022
-
[62]
Price of the au- tonomous strategy with reinforcement learning in mixed-autonomy traffic networks.IEEE Transactions on Intelligent Transportation Systems, 27(2):2741– 2752, Feb
Chanin Eom and Minhae Kwon. Price of the au- tonomous strategy with reinforcement learning in mixed-autonomy traffic networks.IEEE Transactions on Intelligent Transportation Systems, 27(2):2741– 2752, Feb. 2026
2026
-
[63]
The impact of dataset on offline reinforcement learning performance in uav-based emergency network recov- ery tasks.IEEE Communications Letters, 28(5):1058– 1061, May
Jeyeon Eo, Dongsu Lee, and Minhae Kwon. The impact of dataset on offline reinforcement learning performance in uav-based emergency network recov- ery tasks.IEEE Communications Letters, 28(5):1058– 1061, May. 2024
2024
-
[64]
Curriculum reinforcement learning for cohesive team in mobile ad hoc networks.IEEE Communications Letters, 26(8):1809–1813, Aug
Nayoung Kim, Minhae Kwon, and Hyunggon Park. Curriculum reinforcement learning for cohesive team in mobile ad hoc networks.IEEE Communications Letters, 26(8):1809–1813, Aug. 2022
2022
-
[65]
AD4RL: Autonomous driving benchmarks for offline reinforcement learning with value-based dataset
Dongsu Lee, Chanin Eom, and Minhae Kwon. AD4RL: Autonomous driving benchmarks for offline reinforcement learning with value-based dataset. In IEEE International Conference on Robotics and Au- tomation (ICRA), 2024
2024
-
[66]
Episodic future thinking with offline reinforcement learning for au- tonomous driving.IEEE Internet of Things Journal, 12(11):17012–17023, Jun
Dongsu Lee and Minhae Kwon. Episodic future thinking with offline reinforcement learning for au- tonomous driving.IEEE Internet of Things Journal, 12(11):17012–17023, Jun. 2025
2025
-
[67]
A unified principle of pessimism for offline reinforce- ment learning under model mismatch
Yue Wang, Zhongchang Sun, and Shaofeng Zou. A unified principle of pessimism for offline reinforce- ment learning under model mismatch. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[68]
Is value learning really the main bottleneck in offline RL? InAdvances in Neural Information Processing Systems (NeurIPS), 2024
Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline RL? InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[69]
Beyond online sampling: Bridging offline- to-online alignment via dynamic data transformation for LLMs
Zhang Zhang, Guhao Feng, Jian Guan, Di He, and Wei Wu. Beyond online sampling: Bridging offline- to-online alignment via dynamic data transformation for LLMs. InEmpirical Methods in Natural Language Processing (EMNLP), 2025
2025
-
[70]
DigiRL: Training in-the-wild device-control agents with au- tonomous reinforcement learning
Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. DigiRL: Training in-the-wild device-control agents with au- tonomous reinforcement learning. InAdvances in Neu- ral Information Processing Systems (NeurIPS), 2024
2024
-
[71]
Unpacking DPO and PPO: Disentangling best practices for learning from preference feedback
Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah Smith, Yejin Choi, and Hannaneh Hajishirzi. Unpacking DPO and PPO: Disentangling best practices for learning from preference feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[72]
Bridging offline and online reinforcement learning for llms.arXiv preprint arXiv:2506.21495, 2025
Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason Weston, Sainbayar Sukhbaatar, and Ilia Kulikov. Bridging offline and online re- inforcement learning for LLMs.arXiv preprint arXiv:2506.21495, 2025
-
[73]
Scenario-free au- tonomous driving with multi-task offline-to-online re- inforcement learning.IEEE Transactions on Intelli- gent Transportation Systems, 26(9):13317–13330, Sep
Dongsu Lee and Minhae Kwon. Scenario-free au- tonomous driving with multi-task offline-to-online re- inforcement learning.IEEE Transactions on Intelli- gent Transportation Systems, 26(9):13317–13330, Sep. 2025
2025
-
[74]
Test-time fine-tuning of image compression models for multi- task adaptability
Unki Park, Seongmoon Jeong, Youngchan Jang, Gyeong-Moon Park, and Jong Hwan Ko. Test-time fine-tuning of image compression models for multi- task adaptability. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025
2025
-
[75]
KL-regularized reinforce- ment learning is designed to mode collapse
Anthony Chen, Jatin Prakash, Jeff Guo, Rob Fergus, and Rajesh Ranganath. KL-regularized reinforce- ment learning is designed to mode collapse. InIn- ternational Conference on Learning Representations (ICLR), 2026
2026
-
[76]
KL-regularised Q-learning: A token- level action-value perspective on online RLHF
Jason Brown, Lennie Wells, Edward Young, and Ser- gio Bacallado. KL-regularised Q-learning: A token- level action-value perspective on online RLHF. In ICML Workshop on Models of Human Feedback for AI Alignment, 2025
2025
-
[77]
The choice of diver- gence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward
Long Li, Jiaran Hao, Jason Liu, Zhijian Zhou, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, and Yuan Qi. The choice of diver- gence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward. In International Conference on Learning Representations (ICLR), 2026. 13 Multi2: Hierarchical M...
2026
-
[78]
ALF- World: Aligning text and embodied environments for interactive learning
Mohit Shridhar, Xingdi Yuan, Marc Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALF- World: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations (ICLR), 2021
2021
-
[79]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Ke- qin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[80]
Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Chaplot, Diego Casas, Flo- rian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Lavaud, Marie Lachaux, Pierre Stock, Teven Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William Sayed. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.