How to Steer Your Multi-Agent System: Human-LLM Collaborative Planning

Dan Zhang; Estevam Hruschka; Hannah Kim; Zeyu He

arxiv: 2605.23023 · v1 · pith:XOFZUV2Znew · submitted 2026-05-21 · 💻 cs.MA · cs.HC

How to Steer Your Multi-Agent System: Human-LLM Collaborative Planning

Zeyu He , Hannah Kim , Dan Zhang , Estevam Hruschka This is my paper

Pith reviewed 2026-05-25 05:02 UTC · model grok-4.3

classification 💻 cs.MA cs.HC

keywords human-LLM collaborationmulti-agent systemsplanningco-planningdesign spaceuser studyprocess-level supervision

0 comments

The pith

A three-axis design space lets humans steer multi-agent LLM plans with process-level semantic and structural edits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes a design space for human-LLM co-planning to address the difficulty humans face managing complex plans in multi-agent systems due to limited transparency. It defines interactions along mode (semantic versus structural), scope (global versus targeted), and level (low versus high), and implements the space in the AMBIPOM prototype for process-level supervision. A user study reveals how people combine these options into hybrid workflows with effort-control-risk trade-offs, while a benchmark examines how LLMs respond to different revision scopes and strategies. The work produces design insights aimed at making human-AI planning more transparent and controllable than outcome-only supervision allows.

Core claim

We formalize a design space for human-LLM co-planning interactions along three axes: mode (semantic vs. structural), scope (global vs. targeted), and level (low vs. high-level edits). We realize it in AMBIPOM, a prototype supporting process-level supervision through both semantic and structural interactions. Through a user study, we characterize how users navigate this space, revealing hybrid workflows and effort-control-risk trade-offs; through a controlled benchmark, we analyze how LLMs revise plans under varying scope and revision strategies. Our findings yield design insights for more transparent, controllable, and effective human-AI co-planning.

What carries the argument

The three-axis design space (mode, scope, level) for human-LLM co-planning interactions, realized in the AMBIPOM prototype to enable process-level supervision.

If this is right

Users naturally combine semantic and structural edits at varying scopes and levels into hybrid workflows.
Different choices along the three axes produce measurable trade-offs among effort, control gained, and revision risk.
LLMs exhibit distinct revision patterns when changes are global versus targeted or when revision strategies differ.
Process-level supervision through the design space improves transparency compared with outcome-only checks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The axes could be tested in domains outside the current benchmark to check whether additional dimensions emerge.
Systems built on this space might surface suggested axis combinations to reduce user search effort.
The benchmark revision patterns could guide default LLM behaviors when no human edit is supplied.

Load-bearing premise

The three axes of mode, scope, and level sufficiently capture the key dimensions of human-LLM co-planning interactions and the user study plus benchmark results generalize beyond the specific prototype and participant pool.

What would settle it

A larger user study or different multi-agent task in which participants show no hybrid workflows and no measurable gain in perceived control or transparency over outcome-level supervision would falsify the utility of the proposed design space.

Figures

Figures reproduced from arXiv: 2605.23023 by Dan Zhang, Estevam Hruschka, Hannah Kim, Zeyu He.

**Figure 1.** Figure 1: AMBIPOM supports transparent and controllable human–LLM co-planning through a dual-panel interface. (A) Chat Panel supports plan generation, replanning, and execution feedback, with textualized logs of plan changes for transparency. (B) Plan Panel visualizes the current plan as an editable graph, allowing users to inspect and refine the workflow via DMs. where selection denotes either a subgraph for an aut… view at source ↗

**Figure 2.** Figure 2: Each node card shows the agent, task, status, ed [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

In orchestrated multi-agent systems, humans often struggle to manage plans due to their complexity and limited transparency. Existing approaches rely on outcome-level supervision, where users verify only final outputs without visibility into intermediate reasoning. We formalize a design space for human-LLM co-planning interactions along three axes: mode (semantic vs. structural), scope (global vs. targeted), and level (low vs. high-level edits). We realize it in AMBIPOM, a prototype supporting process-level supervision through both semantic and structural interactions. Through a user study, we characterize how users navigate this space, revealing hybrid workflows and effort-control-risk trade-offs; through a controlled benchmark, we analyze how LLMs revise plans under varying scope and revision strategies. Our findings yield design insights for more transparent, controllable, and effective human-AI co-planning. We release code and data at https://github.com/megagonlabs/ambipom.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a usable three-axis design space for process-level human oversight of LLM multi-agent plans, realized in AMBIPOM with a user study and benchmark that surface practical trade-offs.

read the letter

The main takeaway is that the authors formalize human-LLM co-planning along mode (semantic vs structural), scope (global vs targeted), and level (low vs high edits), then build AMBIPOM to support it and test the idea with a user study plus a controlled benchmark on LLM revisions. They release the code and data, which is straightforward and helpful. This moves the conversation from just checking final outputs to giving humans visibility and control over the intermediate steps, which matches a real issue in these systems. The hybrid workflows and effort-control-risk observations are the kind of grounded insights that could actually inform tool design. The three axes look like a reasonable organizing lens for this subfield without pretending to be exhaustive. The work sits cleanly in the HCI and multi-agent systems literature and does not overclaim. The soft spots are the usual ones for this genre: the abstract gives no sample sizes, statistical details, or exact benchmark setup, so the strength of the findings is hard to judge without the full methods and results sections. Generalization beyond the prototype and participant pool is limited by design, and the axes are presented as useful rather than proven complete. That said, nothing in the description suggests circularity or hidden fitting. This paper is for people working on human oversight of multi-agent LLM systems who need concrete design options and trade-off data. A reader in HCI or applied multi-agent work would get direct value from the framework and the released resources. It deserves a serious referee because the contribution is clearly scoped, the resources are open, and the empirical pieces are present even if they need more detail in review.

Referee Report

0 major / 3 minor

Summary. The paper formalizes a three-axis design space for human-LLM co-planning interactions in multi-agent systems (mode: semantic vs. structural; scope: global vs. targeted; level: low vs. high-level edits), realizes the space in the AMBIPOM prototype for process-level supervision, and reports a user study on hybrid workflows and effort-control-risk trade-offs plus a controlled benchmark on LLM plan revisions under varying scopes and strategies. It concludes with design insights for transparent and controllable human-AI co-planning and releases code and data.

Significance. If the empirical findings hold, the work supplies a practical organizing lens for human-AI planning interfaces that moves beyond outcome-only supervision. The public release of code and data is a clear strength that supports reproducibility and follow-on work in the multi-agent systems and HCI communities.

minor comments (3)

Abstract: the description of the user study and benchmark omits sample size, statistical methods, and headline quantitative or qualitative results; adding one sentence summarizing these would improve completeness without lengthening the abstract unduly.
The three axes are presented as a useful design space rather than a provably exhaustive taxonomy; a brief discussion of potential additional dimensions (e.g., temporal or multi-user aspects) would help readers assess scope.
The benchmark section would benefit from an explicit statement of the evaluation metrics and baseline strategies used for the LLM revision comparisons.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the design space, AMBIPOM prototype, user study, benchmark, and code/data release, as well as the minor_revision recommendation. No major comments appear in the provided report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is an HCI/systems design paper that formalizes a three-axis design space for human-LLM co-planning, realizes it in the AMBIPOM prototype, and evaluates it via user study and controlled benchmark. No equations, fitted parameters, predictions, or derivations appear; the axes are presented as an organizing lens rather than a result derived from self-citation chains or self-definitional inputs. The central claims rest on empirical observations from the study and benchmark, which are independent of the formalization itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper introduces a new design space and prototype based on standard HCI evaluation methods; no free parameters, mathematical axioms, or invented physical entities are evident from the abstract.

invented entities (1)

AMBIPOM no independent evidence
purpose: Prototype system supporting semantic and structural interactions for process-level supervision in multi-agent planning
New system introduced to realize the proposed design space.

pith-pipeline@v0.9.0 · 5694 in / 1126 out tokens · 22777 ms · 2026-05-25T05:02:40.864637+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 4 internal anchors

[1]

Ai-Chang, J

M. Ai-Chang, J. Bresina, L. Charest, A. Chase, J.C.-J. Hsu, A. Jonsson, B. Kanefsky, P. Morris, Kanna Rajan, J. Yglesias, B.G. Chafin, W.C. Dias, and P.F. Maldague. 2004. MAPGEN: mixed-initiative planning and scheduling for the Mars Exploration Rover mission.IEEE Intelligent Systems19, 1 (2004), 8–12. doi:10.1109/MIS.2004. 1265878

work page doi:10.1109/mis.2004 2004
[2]

Amine Barrak. 2025. Traceability and Accountability in Role-Specialized Multi- Agent LLM Pipelines. In2025 40th IEEE/ACM International Conference on Auto- mated Software Engineering Workshops (ASEW). IEEE, 315–322

work page 2025
[3]

Case, Amanda, and Tianyi Zhang

Wei-Hao Chen, Weixi Tong, Ph.D. Case, Amanda, and Tianyi Zhang. 2025. Dango: A Mixed-Initiative Data Wrangling System using Large Language Model. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 389, 28 pages. doi:10.1145/3706598.3714135

work page doi:10.1145/3706598.3714135 2025
[4]

Yongchao Chen, Jacob Arkin, Yang Zhang, Nicholas Roy, and Chuchu Fan. 2024. Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems?. In2024 IEEE International Conference on Robotics and Automation (ICRA). 4311–4317. doi:10.1109/ICRA57147.2024.10610676 ACM CAIS ’26, May 26–29, 2026, San Jose, CA, USA He et al

work page doi:10.1109/icra57147.2024.10610676 2024
[5]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Will Epperson, Gagan Bansal, Victor C Dibia, Adam Fourney, Jack Gerrits, Erkang (Eric) Zhu, and Saleema Amershi. 2025. Interactive Debugging and Steering of Multi-Agent AI Systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Ma- chinery, New York, NY, USA, Article 156, 15 pages. doi:1...

work page doi:10.1145/3706598.3713581 2025
[7]

K. J. Kevin Feng, David W. McDonald, and Amy X. Zhang. 2025. Levels of Autonomy for AI Agents. arXiv:2506.12469 [cs.HC] https://arxiv.org/abs/2506. 12469

work page arXiv 2025
[8]

K. J. Kevin Feng, Kevin Pu, Matt Latzke, Tal August, Pao Siangliulue, Jonathan Bragg, Daniel S Weld, Amy X. Zhang, and Joseph Chee Chang. 2026. Cocoa: Co-Planning and Co-Execution with AI Agents. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26). Association for Computing Machinery, New York, NY, USA, Article 16, 23 ...

work page arXiv 2026
[9]

Freund, Brooke Simon, Emery D

Stephen N. Freund, Brooke Simon, Emery D. Berger, and Eunice Jun. 2025. Flowco: Mixed-Initiative Authoring of Reliable End-to-End Data Analyses via Dataflow Graphs and LLMs. InProceedings of the 38th Annual ACM Symposium on User In- terface Software and Technology (UIST ’25). Association for Computing Machinery, New York, NY, USA, Article 182, 20 pages. d...

work page doi:10.1145/3746059.3747636 2025
[10]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large Language Model Based Multi-agents: A Survey of Progress and Challenges. InProceedings of the Thirty- Third International Joint Conference on Artificial Intelligence, IJCAI-24, Kate Larson (Ed.). International Joint Conferences o...

work page doi:10.24963/ijcai.2024/890 2024
[11]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber

work page
[12]

InThe Twelfth International Conference on Learning Representations

MetaGPT: Meta Programming for A Multi-Agent Collaborative Frame- work. InThe Twelfth International Conference on Learning Representations. https: //openreview.net/forum?id=VtmBAGCN7o

work page
[13]

Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Pittsburgh, Pennsylvania, USA)(CHI ’99). Association for Computing Machinery, New York, NY, USA, 159–166. doi:10.1145/302979.303030

work page doi:10.1145/302979.303030 1999
[14]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Trans. Inf. Syst.43, 2, Article 42 (Jan. 2025), 55 pages. doi:10.1145/3703155

work page doi:10.1145/3703155 2025
[15]

Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the planning of LLM agents: A survey. arXiv:2402.02716 [cs.AI] https://arxiv.org/ abs/2402.02716

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. 2024. Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 921, 13 pages

work page 2024
[17]

Hannah Kim, Kushan Mitra, Chen Shen, Dan Zhang, and Estevam Hruschka

work page
[18]

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Ivan Habernal, Peter Schulam, and Jörg Tiede- mann (Eds.)

AIPOM: Agent-aware Interactive Planning for Multi-Agent Systems. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Ivan Habernal, Peter Schulam, and Jörg Tiede- mann (Eds.). Association for Computational Linguistics, Suzhou, China, 85–96. doi:10.18653/v1/2025.emnlp-demos.7

work page doi:10.18653/v1/2025.emnlp-demos.7 2025
[19]

Joongwon Kim, Bhargavi Paranjape, Tushar Khot, and Hannaneh Hajishirzi

work page
[20]

arXiv:2406

Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning. arXiv:2406. [cs.CL]

work page
[21]

Boyi Li, Philipp Wu, Pieter Abbeel, and Jitendra Malik. 2025. Interactive Task Planning with Language Models.Transactions on Machine Learning Research (2025). https://openreview.net/forum?id=VmfWywWuYQ

work page 2025
[22]

Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. 2024. A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth1, 1 (2024), 9

work page 2024
[23]

Anthony Zhe Liu, Xinhe Wang, Jacob Sansom, Yao Fu, Jongwook Choi, Sun- gryull Sohn, Jaekyeom Kim, and Honglak Lee. 2025. Interactive and Expressive Code-Augmented Planning with Large Language Models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shu...

work page doi:10.18653/v1/2025.acl-long.994 2025
[24]

Shuodi Liu, Yingzhuo Liu, Zi Wang, Yusheng Wang, Huijia Wu, Liuyu Xiang, and Zhaofeng He. 2025. Select-Then-Decompose: From Empirical Analysis to Adaptive Selection Strategy for Task Decomposition in Large Language Models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakrabo...

work page doi:10.18653/v1/2025.emnlp-main.278 2025
[25]

Damien Masson, Sylvain Malacria, Géry Casiez, and Daniel Vogel. 2024. Direct- GPT: A Direct Manipulation Interface to Interact with Large Language Models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 975, 16 pages. doi:10.1145/36...

work page doi:10.1145/3613904.3642462 2024
[26]

David J. Moore. 2025. A Taxonomy of Hierarchical Multi-Agent Sys- tems: Design Patterns, Coordination Mechanisms, and Industrial Applications. arXiv:2508.12683 [cs.MA] https://arxiv.org/abs/2508.12683

work page arXiv 2025
[27]

Hussein Mozannar, Gagan Bansal, Cheng Tan, Adam Fourney, Victor Dibia, Jingya Chen, Jack Gerrits, Tyler Payne, Matheus Kunzler Maldaner, Madeleine Grunde-McLaughlin, Eric Zhu, Griffin Bassman, Jacob Alber, Peter Chang, Ricky Loynd, Friederike Niedtner, Ece Kamar, Maya Murad, Rafah Hosn, and Saleema Amershi. 2025. Magentic-UI: Towards Human-in-the-loop Age...

work page arXiv 2025
[28]

Justin Reppert, Ben Rachbach, Charlie George, Luke Stebbing, Jungwon Byun, Maggie Appleton, and Andreas Stuhlmüller. 2023. Iterated De- composition: Improving Science Q&A by Supervising Reasoning Processes. arXiv:2301.01751 [cs.CL] https://arxiv.org/abs/2301.01751

work page arXiv 2023
[29]

Yijia Shao, Vinay Samuel, Yucheng Jiang, John Yang, and Diyi Yang. 2026. Collab- orative Gym: A Framework for Enabling and Evaluating Human-Agent Collabo- ration. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=GDYueXtKXT

work page 2026
[30]

Lijun Sun, Yijun Yang, Qiqi Duan, Yuhui Shi, Chao Lyu, Yu-Cheng Chang, Chin- Teng Lin, and Yang Shen. 2025. Multi-Agent Coordination across Diverse Appli- cations: A Survey. arXiv:2502.14743 [cs.MA] https://arxiv.org/abs/2502.14743

work page arXiv 2025
[31]

Yashar Talebirad and Amirhossein Nadiri. 2023. Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents. arXiv:2306.03314 [cs.AI] https: //arxiv.org/abs/2306.03314

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D. Nguyen. 2025. Multi-Agent Collaboration Mechanisms: A Survey of LLMs. arXiv:2501.06322 [cs.AI] https://arxiv.org/abs/2501.06322

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kamb- hampati. 2023. On the planning abilities of large language models: a critical investigation. InProceedings of the 37th International Conference on Neural Infor- mation Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 3320, 13 pages

work page 2023
[34]

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345

work page 2024
[35]

Zexin Wang, Jingjing Li, Quan Zhou, Haotian Si, Yuanhao Liu, Jianhui Li, Gaogang Xie, Fei Sun, Dan Pei, and Changhua Pei. 2025. A Survey on AgentOps: Cate- gorization, Challenges, and Future Directions.arXiv preprint arXiv:2508.02121 (2025)

work page arXiv 2025
[36]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY...

work page 2022
[37]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang (Eric) Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Ahmed Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next- Gen LLM Applications via Multi-Agent Conversation. InCOLM 2024. https://www.microsoft.com/en-us/research/publication/autogen-enabling- next-gen-llm-a...

work page 2024
[38]

variables

Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, Yangn- ing Li, Dongyuan Li, Renhe Jiang, Xue Liu, and Philip S Yu. 2025. A survey on large language model based human-agent systems.Authorea Preprints(2025). How to Steer Your Multi-Agent System: Human-LLM Collaborative P...

work page 2025
[39]

Break the problem into independent, atomic nodes

work page
[40]

- You may include constants ONLY if they appear explicitly in the original problem statement

Each node is an INSTRUCTION only-describe what must be done, not the result. - You may include constants ONLY if they appear explicitly in the original problem statement. - Do not invent, look up, or leak unknown values into the plan; such values must be produced by earlier nodes or via [search]. - Do NOT mention any other nodes in the task description. -...

work page
[41]

A single agent must be able to complete each node using ONLY: - the node's instruction, - the specified agent, and - outputs from its prereqs

work page
[42]

the original question

Do NOT reference "the original question" inside nodes. Rewrite what's needed directly into each node's instruction

work page
[43]

agent_name

Use exactly one agent per node in the "agent_name" field. If multiple agents seem required, split the node

work page
[44]

Use snake_case for output variable names

Include any necessary variable names directly in the instruction so the executing agent has everything it needs. Use snake_case for output variable names

work page
[45]

- A single sink node (the node with the highest id) is the final output node

Produce a valid DAG: - No isolated nodes. - A single sink node (the node with the highest id) is the final output node

work page
[46]

- Every edge must point from an existing output to a named input expected by the destination node

Edges: - Only create edges for actual data dependencies (where a later node's input name matches a prior node's output variable name). - Every edge must point from an existing output to a named input expected by the destination node. <given task> Prompt: Replanning <same system prompt as plan generation> A plan and user feedback are given to you. Your job...

work page 2026
[47]

id": -1, // Use negative integers (-1, -2, -3, ...) for all new nodes inside the replanned sub-graph

A selected sub-graph (a set of nodes and connecting edges) as the focus for replanning. Your goal is to regenerate ONLY the selected sub-graph nodes, while keeping the interface (inputs/outputs defined by edges connecting to outside nodes) fully consistent. ==================== GLOBAL INSTRUCTIONS ==================== - Every new node generated inside the...

work page
[48]

- Any variable appearing on outgoing edges to outside the sub- graph must be produced as an output by at least one replanned node

**Boundary consistency:** - Any variable appearing on incoming edges from outside the sub- graph must appear as an input in at least one replanned node. - Any variable appearing on outgoing edges to outside the sub- graph must be produced as an output by at least one replanned node. - Outside node IDs and boundary edge structures must remain exactly the same

work page
[49]

- Split tasks if multiple agent types would be required

**Atomic instructions:** - Each node must remain atomic, executable by exactly one agent. - Split tasks if multiple agent types would be required

work page
[50]

the original question

**Self-contained tasks:** - Node instructions must not reference other nodes or "the original question." - Use variable names verbatim from inputs/outputs

work page
[51]

nodes": [ <list of replanned node objects> ],

**Valid DAG:** - No isolated nodes. - Exactly one sink node inside the replanned sub-graph. ======================== RESPONSE FORMAT (JSON) ======================== { "nodes": [ <list of replanned node objects> ], "edges": [ <list of replanned edge objects> ] } A sub-graph plan and user feedback are given to you. You job is to revise the subplan based on ...

work page 2026

[1] [1]

Ai-Chang, J

M. Ai-Chang, J. Bresina, L. Charest, A. Chase, J.C.-J. Hsu, A. Jonsson, B. Kanefsky, P. Morris, Kanna Rajan, J. Yglesias, B.G. Chafin, W.C. Dias, and P.F. Maldague. 2004. MAPGEN: mixed-initiative planning and scheduling for the Mars Exploration Rover mission.IEEE Intelligent Systems19, 1 (2004), 8–12. doi:10.1109/MIS.2004. 1265878

work page doi:10.1109/mis.2004 2004

[2] [2]

Amine Barrak. 2025. Traceability and Accountability in Role-Specialized Multi- Agent LLM Pipelines. In2025 40th IEEE/ACM International Conference on Auto- mated Software Engineering Workshops (ASEW). IEEE, 315–322

work page 2025

[3] [3]

Case, Amanda, and Tianyi Zhang

Wei-Hao Chen, Weixi Tong, Ph.D. Case, Amanda, and Tianyi Zhang. 2025. Dango: A Mixed-Initiative Data Wrangling System using Large Language Model. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 389, 28 pages. doi:10.1145/3706598.3714135

work page doi:10.1145/3706598.3714135 2025

[4] [4]

Yongchao Chen, Jacob Arkin, Yang Zhang, Nicholas Roy, and Chuchu Fan. 2024. Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems?. In2024 IEEE International Conference on Robotics and Automation (ICRA). 4311–4317. doi:10.1109/ICRA57147.2024.10610676 ACM CAIS ’26, May 26–29, 2026, San Jose, CA, USA He et al

work page doi:10.1109/icra57147.2024.10610676 2024

[5] [5]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Will Epperson, Gagan Bansal, Victor C Dibia, Adam Fourney, Jack Gerrits, Erkang (Eric) Zhu, and Saleema Amershi. 2025. Interactive Debugging and Steering of Multi-Agent AI Systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Ma- chinery, New York, NY, USA, Article 156, 15 pages. doi:1...

work page doi:10.1145/3706598.3713581 2025

[7] [7]

K. J. Kevin Feng, David W. McDonald, and Amy X. Zhang. 2025. Levels of Autonomy for AI Agents. arXiv:2506.12469 [cs.HC] https://arxiv.org/abs/2506. 12469

work page arXiv 2025

[8] [8]

K. J. Kevin Feng, Kevin Pu, Matt Latzke, Tal August, Pao Siangliulue, Jonathan Bragg, Daniel S Weld, Amy X. Zhang, and Joseph Chee Chang. 2026. Cocoa: Co-Planning and Co-Execution with AI Agents. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26). Association for Computing Machinery, New York, NY, USA, Article 16, 23 ...

work page arXiv 2026

[9] [9]

Freund, Brooke Simon, Emery D

Stephen N. Freund, Brooke Simon, Emery D. Berger, and Eunice Jun. 2025. Flowco: Mixed-Initiative Authoring of Reliable End-to-End Data Analyses via Dataflow Graphs and LLMs. InProceedings of the 38th Annual ACM Symposium on User In- terface Software and Technology (UIST ’25). Association for Computing Machinery, New York, NY, USA, Article 182, 20 pages. d...

work page doi:10.1145/3746059.3747636 2025

[10] [10]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large Language Model Based Multi-agents: A Survey of Progress and Challenges. InProceedings of the Thirty- Third International Joint Conference on Artificial Intelligence, IJCAI-24, Kate Larson (Ed.). International Joint Conferences o...

work page doi:10.24963/ijcai.2024/890 2024

[11] [11]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber

work page

[12] [12]

InThe Twelfth International Conference on Learning Representations

MetaGPT: Meta Programming for A Multi-Agent Collaborative Frame- work. InThe Twelfth International Conference on Learning Representations. https: //openreview.net/forum?id=VtmBAGCN7o

work page

[13] [13]

Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Pittsburgh, Pennsylvania, USA)(CHI ’99). Association for Computing Machinery, New York, NY, USA, 159–166. doi:10.1145/302979.303030

work page doi:10.1145/302979.303030 1999

[14] [14]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Trans. Inf. Syst.43, 2, Article 42 (Jan. 2025), 55 pages. doi:10.1145/3703155

work page doi:10.1145/3703155 2025

[15] [15]

Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the planning of LLM agents: A survey. arXiv:2402.02716 [cs.AI] https://arxiv.org/ abs/2402.02716

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. 2024. Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 921, 13 pages

work page 2024

[17] [17]

Hannah Kim, Kushan Mitra, Chen Shen, Dan Zhang, and Estevam Hruschka

work page

[18] [18]

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Ivan Habernal, Peter Schulam, and Jörg Tiede- mann (Eds.)

AIPOM: Agent-aware Interactive Planning for Multi-Agent Systems. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Ivan Habernal, Peter Schulam, and Jörg Tiede- mann (Eds.). Association for Computational Linguistics, Suzhou, China, 85–96. doi:10.18653/v1/2025.emnlp-demos.7

work page doi:10.18653/v1/2025.emnlp-demos.7 2025

[19] [19]

Joongwon Kim, Bhargavi Paranjape, Tushar Khot, and Hannaneh Hajishirzi

work page

[20] [20]

arXiv:2406

Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning. arXiv:2406. [cs.CL]

work page

[21] [21]

Boyi Li, Philipp Wu, Pieter Abbeel, and Jitendra Malik. 2025. Interactive Task Planning with Language Models.Transactions on Machine Learning Research (2025). https://openreview.net/forum?id=VmfWywWuYQ

work page 2025

[22] [22]

Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. 2024. A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth1, 1 (2024), 9

work page 2024

[23] [23]

Anthony Zhe Liu, Xinhe Wang, Jacob Sansom, Yao Fu, Jongwook Choi, Sun- gryull Sohn, Jaekyeom Kim, and Honglak Lee. 2025. Interactive and Expressive Code-Augmented Planning with Large Language Models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shu...

work page doi:10.18653/v1/2025.acl-long.994 2025

[24] [24]

Shuodi Liu, Yingzhuo Liu, Zi Wang, Yusheng Wang, Huijia Wu, Liuyu Xiang, and Zhaofeng He. 2025. Select-Then-Decompose: From Empirical Analysis to Adaptive Selection Strategy for Task Decomposition in Large Language Models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakrabo...

work page doi:10.18653/v1/2025.emnlp-main.278 2025

[25] [25]

Damien Masson, Sylvain Malacria, Géry Casiez, and Daniel Vogel. 2024. Direct- GPT: A Direct Manipulation Interface to Interact with Large Language Models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 975, 16 pages. doi:10.1145/36...

work page doi:10.1145/3613904.3642462 2024

[26] [26]

David J. Moore. 2025. A Taxonomy of Hierarchical Multi-Agent Sys- tems: Design Patterns, Coordination Mechanisms, and Industrial Applications. arXiv:2508.12683 [cs.MA] https://arxiv.org/abs/2508.12683

work page arXiv 2025

[27] [27]

Hussein Mozannar, Gagan Bansal, Cheng Tan, Adam Fourney, Victor Dibia, Jingya Chen, Jack Gerrits, Tyler Payne, Matheus Kunzler Maldaner, Madeleine Grunde-McLaughlin, Eric Zhu, Griffin Bassman, Jacob Alber, Peter Chang, Ricky Loynd, Friederike Niedtner, Ece Kamar, Maya Murad, Rafah Hosn, and Saleema Amershi. 2025. Magentic-UI: Towards Human-in-the-loop Age...

work page arXiv 2025

[28] [28]

Justin Reppert, Ben Rachbach, Charlie George, Luke Stebbing, Jungwon Byun, Maggie Appleton, and Andreas Stuhlmüller. 2023. Iterated De- composition: Improving Science Q&A by Supervising Reasoning Processes. arXiv:2301.01751 [cs.CL] https://arxiv.org/abs/2301.01751

work page arXiv 2023

[29] [29]

Yijia Shao, Vinay Samuel, Yucheng Jiang, John Yang, and Diyi Yang. 2026. Collab- orative Gym: A Framework for Enabling and Evaluating Human-Agent Collabo- ration. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=GDYueXtKXT

work page 2026

[30] [30]

Lijun Sun, Yijun Yang, Qiqi Duan, Yuhui Shi, Chao Lyu, Yu-Cheng Chang, Chin- Teng Lin, and Yang Shen. 2025. Multi-Agent Coordination across Diverse Appli- cations: A Survey. arXiv:2502.14743 [cs.MA] https://arxiv.org/abs/2502.14743

work page arXiv 2025

[31] [31]

Yashar Talebirad and Amirhossein Nadiri. 2023. Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents. arXiv:2306.03314 [cs.AI] https: //arxiv.org/abs/2306.03314

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D. Nguyen. 2025. Multi-Agent Collaboration Mechanisms: A Survey of LLMs. arXiv:2501.06322 [cs.AI] https://arxiv.org/abs/2501.06322

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kamb- hampati. 2023. On the planning abilities of large language models: a critical investigation. InProceedings of the 37th International Conference on Neural Infor- mation Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 3320, 13 pages

work page 2023

[34] [34]

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345

work page 2024

[35] [35]

Zexin Wang, Jingjing Li, Quan Zhou, Haotian Si, Yuanhao Liu, Jianhui Li, Gaogang Xie, Fei Sun, Dan Pei, and Changhua Pei. 2025. A Survey on AgentOps: Cate- gorization, Challenges, and Future Directions.arXiv preprint arXiv:2508.02121 (2025)

work page arXiv 2025

[36] [36]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY...

work page 2022

[37] [37]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang (Eric) Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Ahmed Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next- Gen LLM Applications via Multi-Agent Conversation. InCOLM 2024. https://www.microsoft.com/en-us/research/publication/autogen-enabling- next-gen-llm-a...

work page 2024

[38] [38]

variables

Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, Yangn- ing Li, Dongyuan Li, Renhe Jiang, Xue Liu, and Philip S Yu. 2025. A survey on large language model based human-agent systems.Authorea Preprints(2025). How to Steer Your Multi-Agent System: Human-LLM Collaborative P...

work page 2025

[39] [39]

Break the problem into independent, atomic nodes

work page

[40] [40]

- You may include constants ONLY if they appear explicitly in the original problem statement

Each node is an INSTRUCTION only-describe what must be done, not the result. - You may include constants ONLY if they appear explicitly in the original problem statement. - Do not invent, look up, or leak unknown values into the plan; such values must be produced by earlier nodes or via [search]. - Do NOT mention any other nodes in the task description. -...

work page

[41] [41]

A single agent must be able to complete each node using ONLY: - the node's instruction, - the specified agent, and - outputs from its prereqs

work page

[42] [42]

the original question

Do NOT reference "the original question" inside nodes. Rewrite what's needed directly into each node's instruction

work page

[43] [43]

agent_name

Use exactly one agent per node in the "agent_name" field. If multiple agents seem required, split the node

work page

[44] [44]

Use snake_case for output variable names

Include any necessary variable names directly in the instruction so the executing agent has everything it needs. Use snake_case for output variable names

work page

[45] [45]

- A single sink node (the node with the highest id) is the final output node

Produce a valid DAG: - No isolated nodes. - A single sink node (the node with the highest id) is the final output node

work page

[46] [46]

- Every edge must point from an existing output to a named input expected by the destination node

Edges: - Only create edges for actual data dependencies (where a later node's input name matches a prior node's output variable name). - Every edge must point from an existing output to a named input expected by the destination node. <given task> Prompt: Replanning <same system prompt as plan generation> A plan and user feedback are given to you. Your job...

work page 2026

[47] [47]

id": -1, // Use negative integers (-1, -2, -3, ...) for all new nodes inside the replanned sub-graph

A selected sub-graph (a set of nodes and connecting edges) as the focus for replanning. Your goal is to regenerate ONLY the selected sub-graph nodes, while keeping the interface (inputs/outputs defined by edges connecting to outside nodes) fully consistent. ==================== GLOBAL INSTRUCTIONS ==================== - Every new node generated inside the...

work page

[48] [48]

- Any variable appearing on outgoing edges to outside the sub- graph must be produced as an output by at least one replanned node

**Boundary consistency:** - Any variable appearing on incoming edges from outside the sub- graph must appear as an input in at least one replanned node. - Any variable appearing on outgoing edges to outside the sub- graph must be produced as an output by at least one replanned node. - Outside node IDs and boundary edge structures must remain exactly the same

work page

[49] [49]

- Split tasks if multiple agent types would be required

**Atomic instructions:** - Each node must remain atomic, executable by exactly one agent. - Split tasks if multiple agent types would be required

work page

[50] [50]

the original question

**Self-contained tasks:** - Node instructions must not reference other nodes or "the original question." - Use variable names verbatim from inputs/outputs

work page

[51] [51]

nodes": [ <list of replanned node objects> ],

**Valid DAG:** - No isolated nodes. - Exactly one sink node inside the replanned sub-graph. ======================== RESPONSE FORMAT (JSON) ======================== { "nodes": [ <list of replanned node objects> ], "edges": [ <list of replanned edge objects> ] } A sub-graph plan and user feedback are given to you. You job is to revise the subplan based on ...

work page 2026