How to Steer Your Multi-Agent System: Human-LLM Collaborative Planning
Pith reviewed 2026-05-25 05:02 UTC · model grok-4.3
The pith
A three-axis design space lets humans steer multi-agent LLM plans with process-level semantic and structural edits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formalize a design space for human-LLM co-planning interactions along three axes: mode (semantic vs. structural), scope (global vs. targeted), and level (low vs. high-level edits). We realize it in AMBIPOM, a prototype supporting process-level supervision through both semantic and structural interactions. Through a user study, we characterize how users navigate this space, revealing hybrid workflows and effort-control-risk trade-offs; through a controlled benchmark, we analyze how LLMs revise plans under varying scope and revision strategies. Our findings yield design insights for more transparent, controllable, and effective human-AI co-planning.
What carries the argument
The three-axis design space (mode, scope, level) for human-LLM co-planning interactions, realized in the AMBIPOM prototype to enable process-level supervision.
If this is right
- Users naturally combine semantic and structural edits at varying scopes and levels into hybrid workflows.
- Different choices along the three axes produce measurable trade-offs among effort, control gained, and revision risk.
- LLMs exhibit distinct revision patterns when changes are global versus targeted or when revision strategies differ.
- Process-level supervision through the design space improves transparency compared with outcome-only checks.
Where Pith is reading between the lines
- The axes could be tested in domains outside the current benchmark to check whether additional dimensions emerge.
- Systems built on this space might surface suggested axis combinations to reduce user search effort.
- The benchmark revision patterns could guide default LLM behaviors when no human edit is supplied.
Load-bearing premise
The three axes of mode, scope, and level sufficiently capture the key dimensions of human-LLM co-planning interactions and the user study plus benchmark results generalize beyond the specific prototype and participant pool.
What would settle it
A larger user study or different multi-agent task in which participants show no hybrid workflows and no measurable gain in perceived control or transparency over outcome-level supervision would falsify the utility of the proposed design space.
Figures
read the original abstract
In orchestrated multi-agent systems, humans often struggle to manage plans due to their complexity and limited transparency. Existing approaches rely on outcome-level supervision, where users verify only final outputs without visibility into intermediate reasoning. We formalize a design space for human-LLM co-planning interactions along three axes: mode (semantic vs. structural), scope (global vs. targeted), and level (low vs. high-level edits). We realize it in AMBIPOM, a prototype supporting process-level supervision through both semantic and structural interactions. Through a user study, we characterize how users navigate this space, revealing hybrid workflows and effort-control-risk trade-offs; through a controlled benchmark, we analyze how LLMs revise plans under varying scope and revision strategies. Our findings yield design insights for more transparent, controllable, and effective human-AI co-planning. We release code and data at https://github.com/megagonlabs/ambipom.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes a three-axis design space for human-LLM co-planning interactions in multi-agent systems (mode: semantic vs. structural; scope: global vs. targeted; level: low vs. high-level edits), realizes the space in the AMBIPOM prototype for process-level supervision, and reports a user study on hybrid workflows and effort-control-risk trade-offs plus a controlled benchmark on LLM plan revisions under varying scopes and strategies. It concludes with design insights for transparent and controllable human-AI co-planning and releases code and data.
Significance. If the empirical findings hold, the work supplies a practical organizing lens for human-AI planning interfaces that moves beyond outcome-only supervision. The public release of code and data is a clear strength that supports reproducibility and follow-on work in the multi-agent systems and HCI communities.
minor comments (3)
- Abstract: the description of the user study and benchmark omits sample size, statistical methods, and headline quantitative or qualitative results; adding one sentence summarizing these would improve completeness without lengthening the abstract unduly.
- The three axes are presented as a useful design space rather than a provably exhaustive taxonomy; a brief discussion of potential additional dimensions (e.g., temporal or multi-user aspects) would help readers assess scope.
- The benchmark section would benefit from an explicit statement of the evaluation metrics and baseline strategies used for the LLM revision comparisons.
Simulated Author's Rebuttal
We thank the referee for the positive summary, recognition of the design space, AMBIPOM prototype, user study, benchmark, and code/data release, as well as the minor_revision recommendation. No major comments appear in the provided report.
Circularity Check
No significant circularity detected
full rationale
This is an HCI/systems design paper that formalizes a three-axis design space for human-LLM co-planning, realizes it in the AMBIPOM prototype, and evaluates it via user study and controlled benchmark. No equations, fitted parameters, predictions, or derivations appear; the axes are presented as an organizing lens rather than a result derived from self-citation chains or self-definitional inputs. The central claims rest on empirical observations from the study and benchmark, which are independent of the formalization itself.
Axiom & Free-Parameter Ledger
invented entities (1)
-
AMBIPOM
no independent evidence
Reference graph
Works this paper leans on
-
[1]
M. Ai-Chang, J. Bresina, L. Charest, A. Chase, J.C.-J. Hsu, A. Jonsson, B. Kanefsky, P. Morris, Kanna Rajan, J. Yglesias, B.G. Chafin, W.C. Dias, and P.F. Maldague. 2004. MAPGEN: mixed-initiative planning and scheduling for the Mars Exploration Rover mission.IEEE Intelligent Systems19, 1 (2004), 8–12. doi:10.1109/MIS.2004. 1265878
-
[2]
Amine Barrak. 2025. Traceability and Accountability in Role-Specialized Multi- Agent LLM Pipelines. In2025 40th IEEE/ACM International Conference on Auto- mated Software Engineering Workshops (ASEW). IEEE, 315–322
work page 2025
-
[3]
Case, Amanda, and Tianyi Zhang
Wei-Hao Chen, Weixi Tong, Ph.D. Case, Amanda, and Tianyi Zhang. 2025. Dango: A Mixed-Initiative Data Wrangling System using Large Language Model. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 389, 28 pages. doi:10.1145/3706598.3714135
-
[4]
Yongchao Chen, Jacob Arkin, Yang Zhang, Nicholas Roy, and Chuchu Fan. 2024. Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems?. In2024 IEEE International Conference on Robotics and Automation (ICRA). 4311–4317. doi:10.1109/ICRA57147.2024.10610676 ACM CAIS ’26, May 26–29, 2026, San Jose, CA, USA He et al
-
[5]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Will Epperson, Gagan Bansal, Victor C Dibia, Adam Fourney, Jack Gerrits, Erkang (Eric) Zhu, and Saleema Amershi. 2025. Interactive Debugging and Steering of Multi-Agent AI Systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Ma- chinery, New York, NY, USA, Article 156, 15 pages. doi:1...
- [7]
-
[8]
K. J. Kevin Feng, Kevin Pu, Matt Latzke, Tal August, Pao Siangliulue, Jonathan Bragg, Daniel S Weld, Amy X. Zhang, and Joseph Chee Chang. 2026. Cocoa: Co-Planning and Co-Execution with AI Agents. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26). Association for Computing Machinery, New York, NY, USA, Article 16, 23 ...
-
[9]
Stephen N. Freund, Brooke Simon, Emery D. Berger, and Eunice Jun. 2025. Flowco: Mixed-Initiative Authoring of Reliable End-to-End Data Analyses via Dataflow Graphs and LLMs. InProceedings of the 38th Annual ACM Symposium on User In- terface Software and Technology (UIST ’25). Association for Computing Machinery, New York, NY, USA, Article 182, 20 pages. d...
-
[10]
Chawla, Olaf Wiest, and Xiangliang Zhang
Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large Language Model Based Multi-agents: A Survey of Progress and Challenges. InProceedings of the Thirty- Third International Joint Conference on Artificial Intelligence, IJCAI-24, Kate Larson (Ed.). International Joint Conferences o...
-
[11]
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber
-
[12]
InThe Twelfth International Conference on Learning Representations
MetaGPT: Meta Programming for A Multi-Agent Collaborative Frame- work. InThe Twelfth International Conference on Learning Representations. https: //openreview.net/forum?id=VtmBAGCN7o
-
[13]
Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Pittsburgh, Pennsylvania, USA)(CHI ’99). Association for Computing Machinery, New York, NY, USA, 159–166. doi:10.1145/302979.303030
-
[14]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Trans. Inf. Syst.43, 2, Article 42 (Jan. 2025), 55 pages. doi:10.1145/3703155
-
[15]
Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the planning of LLM agents: A survey. arXiv:2402.02716 [cs.AI] https://arxiv.org/ abs/2402.02716
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. 2024. Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 921, 13 pages
work page 2024
-
[17]
Hannah Kim, Kushan Mitra, Chen Shen, Dan Zhang, and Estevam Hruschka
-
[18]
AIPOM: Agent-aware Interactive Planning for Multi-Agent Systems. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Ivan Habernal, Peter Schulam, and Jörg Tiede- mann (Eds.). Association for Computational Linguistics, Suzhou, China, 85–96. doi:10.18653/v1/2025.emnlp-demos.7
-
[19]
Joongwon Kim, Bhargavi Paranjape, Tushar Khot, and Hannaneh Hajishirzi
-
[20]
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning. arXiv:2406. [cs.CL]
-
[21]
Boyi Li, Philipp Wu, Pieter Abbeel, and Jitendra Malik. 2025. Interactive Task Planning with Language Models.Transactions on Machine Learning Research (2025). https://openreview.net/forum?id=VmfWywWuYQ
work page 2025
-
[22]
Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. 2024. A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth1, 1 (2024), 9
work page 2024
-
[23]
Anthony Zhe Liu, Xinhe Wang, Jacob Sansom, Yao Fu, Jongwook Choi, Sun- gryull Sohn, Jaekyeom Kim, and Honglak Lee. 2025. Interactive and Expressive Code-Augmented Planning with Large Language Models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shu...
-
[24]
Shuodi Liu, Yingzhuo Liu, Zi Wang, Yusheng Wang, Huijia Wu, Liuyu Xiang, and Zhaofeng He. 2025. Select-Then-Decompose: From Empirical Analysis to Adaptive Selection Strategy for Task Decomposition in Large Language Models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakrabo...
-
[25]
Damien Masson, Sylvain Malacria, Géry Casiez, and Daniel Vogel. 2024. Direct- GPT: A Direct Manipulation Interface to Interact with Large Language Models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 975, 16 pages. doi:10.1145/36...
- [26]
-
[27]
Hussein Mozannar, Gagan Bansal, Cheng Tan, Adam Fourney, Victor Dibia, Jingya Chen, Jack Gerrits, Tyler Payne, Matheus Kunzler Maldaner, Madeleine Grunde-McLaughlin, Eric Zhu, Griffin Bassman, Jacob Alber, Peter Chang, Ricky Loynd, Friederike Niedtner, Ece Kamar, Maya Murad, Rafah Hosn, and Saleema Amershi. 2025. Magentic-UI: Towards Human-in-the-loop Age...
- [28]
-
[29]
Yijia Shao, Vinay Samuel, Yucheng Jiang, John Yang, and Diyi Yang. 2026. Collab- orative Gym: A Framework for Enabling and Evaluating Human-Agent Collabo- ration. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=GDYueXtKXT
work page 2026
- [30]
-
[31]
Yashar Talebirad and Amirhossein Nadiri. 2023. Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents. arXiv:2306.03314 [cs.AI] https: //arxiv.org/abs/2306.03314
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D. Nguyen. 2025. Multi-Agent Collaboration Mechanisms: A Survey of LLMs. arXiv:2501.06322 [cs.AI] https://arxiv.org/abs/2501.06322
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kamb- hampati. 2023. On the planning abilities of large language models: a critical investigation. InProceedings of the 37th International Conference on Neural Infor- mation Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 3320, 13 pages
work page 2023
-
[34]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345
work page 2024
- [35]
-
[36]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY...
work page 2022
-
[37]
White, Doug Burger, and Chi Wang
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang (Eric) Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Ahmed Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next- Gen LLM Applications via Multi-Agent Conversation. InCOLM 2024. https://www.microsoft.com/en-us/research/publication/autogen-enabling- next-gen-llm-a...
work page 2024
-
[38]
Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, Yangn- ing Li, Dongyuan Li, Renhe Jiang, Xue Liu, and Philip S Yu. 2025. A survey on large language model based human-agent systems.Authorea Preprints(2025). How to Steer Your Multi-Agent System: Human-LLM Collaborative P...
work page 2025
-
[39]
Break the problem into independent, atomic nodes
-
[40]
- You may include constants ONLY if they appear explicitly in the original problem statement
Each node is an INSTRUCTION only-describe what must be done, not the result. - You may include constants ONLY if they appear explicitly in the original problem statement. - Do not invent, look up, or leak unknown values into the plan; such values must be produced by earlier nodes or via [search]. - Do NOT mention any other nodes in the task description. -...
-
[41]
A single agent must be able to complete each node using ONLY: - the node's instruction, - the specified agent, and - outputs from its prereqs
-
[42]
Do NOT reference "the original question" inside nodes. Rewrite what's needed directly into each node's instruction
-
[43]
Use exactly one agent per node in the "agent_name" field. If multiple agents seem required, split the node
-
[44]
Use snake_case for output variable names
Include any necessary variable names directly in the instruction so the executing agent has everything it needs. Use snake_case for output variable names
-
[45]
- A single sink node (the node with the highest id) is the final output node
Produce a valid DAG: - No isolated nodes. - A single sink node (the node with the highest id) is the final output node
-
[46]
- Every edge must point from an existing output to a named input expected by the destination node
Edges: - Only create edges for actual data dependencies (where a later node's input name matches a prior node's output variable name). - Every edge must point from an existing output to a named input expected by the destination node. <given task> Prompt: Replanning <same system prompt as plan generation> A plan and user feedback are given to you. Your job...
work page 2026
-
[47]
id": -1, // Use negative integers (-1, -2, -3, ...) for all new nodes inside the replanned sub-graph
A selected sub-graph (a set of nodes and connecting edges) as the focus for replanning. Your goal is to regenerate ONLY the selected sub-graph nodes, while keeping the interface (inputs/outputs defined by edges connecting to outside nodes) fully consistent. ==================== GLOBAL INSTRUCTIONS ==================== - Every new node generated inside the...
-
[48]
**Boundary consistency:** - Any variable appearing on incoming edges from outside the sub- graph must appear as an input in at least one replanned node. - Any variable appearing on outgoing edges to outside the sub- graph must be produced as an output by at least one replanned node. - Outside node IDs and boundary edge structures must remain exactly the same
-
[49]
- Split tasks if multiple agent types would be required
**Atomic instructions:** - Each node must remain atomic, executable by exactly one agent. - Split tasks if multiple agent types would be required
-
[50]
**Self-contained tasks:** - Node instructions must not reference other nodes or "the original question." - Use variable names verbatim from inputs/outputs
-
[51]
nodes": [ <list of replanned node objects> ],
**Valid DAG:** - No isolated nodes. - Exactly one sink node inside the replanned sub-graph. ======================== RESPONSE FORMAT (JSON) ======================== { "nodes": [ <list of replanned node objects> ], "edges": [ <list of replanned edge objects> ] } A sub-graph plan and user feedback are given to you. You job is to revise the subplan based on ...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.