Recognition: no theorem link
Self-Optimizing Multi-Agent Systems for Deep Research
Pith reviewed 2026-05-13 18:10 UTC · model grok-4.3
The pith
Multi-agent systems self-optimize prompts through self-play to match or exceed expert performance in deep research.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By enabling agents in a multi-agent architecture to self-play and explore different prompt combinations, the system can generate high-quality Deep Research outputs that match or outperform those from expert-crafted prompts, addressing the limitations of static, hand-engineered designs.
What carries the argument
Self-play optimization of prompt combinations, where an orchestrator agent coordinates worker agents that test and refine prompts autonomously.
If this is right
- Reduces the need for time-consuming hand-engineering of prompts by experts.
- Creates systems that can potentially adapt to new complex information needs more readily.
- Lowers the overall cost and effort required to build effective deep research tools.
- May lead to more robust performance across diverse document collections and queries.
Where Pith is reading between the lines
- Similar self-optimization techniques could apply to other agent-based tasks like multi-step planning or collaborative problem-solving.
- Over time, such systems might develop emergent behaviors not anticipated in initial designs.
- Combining this with larger language models could further enhance synthesis capabilities in research scenarios.
Load-bearing premise
That the performance improvements observed from self-play on specific tested tasks will hold for entirely new queries and document sets without the system overfitting to its training environment.
What would settle it
Running the self-optimized system on a fresh set of complex user queries with new document collections and observing if it consistently fails to match or exceed the quality of expert-designed prompts.
Figures
read the original abstract
Given a user's complex information need, a multi-agent Deep Research system iteratively plans, retrieves, and synthesizes evidence across hundreds of documents to produce a high-quality answer. In one possible architecture, an orchestrator agent coordinates the process, while parallel worker agents execute tasks. Current Deep Research systems, however, often rely on hand-engineered prompts and static architectures, making improvement brittle, expensive, and time-consuming. We therefore explore various multi-agent optimization methods to show that enabling agents to self-play and explore different prompt combinations can produce high-quality Deep Research systems that match or outperform expert-crafted prompts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes multi-agent Deep Research systems that iteratively plan, retrieve, and synthesize evidence from large document collections to answer complex user queries. It explores self-optimization methods in which agents engage in self-play to discover effective prompt combinations, claiming that the resulting systems can match or outperform those built with expert-crafted prompts and static architectures.
Significance. If the empirical claims hold under rigorous testing, the work could reduce the cost and brittleness of prompt engineering for multi-agent retrieval and synthesis pipelines, offering a path toward more adaptive Deep Research systems. The approach aligns with growing interest in automated agent design within information retrieval.
major comments (2)
- [Abstract] Abstract and evaluation sections: the central claim that self-play optimization yields transferable performance gains rests on unverified generalization. No held-out query sets, cross-collection tests, or overfitting controls are described, so reported improvements versus expert prompts could be artifacts of the specific optimization environment rather than robust advances.
- [Methods] Methods and results: concrete details on the self-play procedure, prompt search space, optimization algorithm, datasets, metrics (e.g., answer quality, retrieval precision), baselines, and ablation studies are absent. Without these, the performance claim cannot be evaluated or reproduced.
minor comments (2)
- [Architecture] Clarify the distinction between the orchestrator and worker agents and how self-play coordinates prompt updates across them.
- [Discussion] Add explicit discussion of computational cost and scalability of the self-play process relative to hand-engineering.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We address each major comment point by point below and will revise the manuscript accordingly to improve clarity, rigor, and reproducibility.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation sections: the central claim that self-play optimization yields transferable performance gains rests on unverified generalization. No held-out query sets, cross-collection tests, or overfitting controls are described, so reported improvements versus expert prompts could be artifacts of the specific optimization environment rather than robust advances.
Authors: We agree that the current manuscript does not sufficiently demonstrate generalization. In the revised version we will add held-out query evaluations, cross-collection experiments on additional document sets, and explicit overfitting controls (e.g., monitoring performance on a validation split during self-play). These additions will allow us to test whether the observed gains transfer beyond the optimization environment. revision: yes
-
Referee: [Methods] Methods and results: concrete details on the self-play procedure, prompt search space, optimization algorithm, datasets, metrics (e.g., answer quality, retrieval precision), baselines, and ablation studies are absent. Without these, the performance claim cannot be evaluated or reproduced.
Authors: We acknowledge that the submitted draft omits necessary implementation details. The revised manuscript will contain an expanded Methods section that fully specifies the self-play procedure, the prompt search space, the optimization algorithm, the datasets, the evaluation metrics for answer quality and retrieval precision, all baselines, and the ablation studies performed. We will also include pseudocode to support reproducibility. revision: yes
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Agrawal, L.A., Tan, S., Soylu, D., Ziems, N., Khare, R., Opsahl-Ong, K., Singhvi, A., Shandilya, H., Ryan, M.J., Jiang, M., Potts, C., Sen, K., Dimakis, A.G., Stoica, I., Klein, D., Zaharia, M., Khattab, O.: Gepa: Reflective prompt evolution can outperform reinforcement learning (2025), https://arxiv.org/abs/2507.19457
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Asai, A., He, J., Shao, R., Shi, W., Singh, A., Chang, J.C., Lo, K., Soldaini, L., Feldman, S., D’arcy, M., Wadden, D., Latzke, M., Tian, M., Ji, P., Liu, S., Tong, H., Wu, B., Xiong, Y., Zettlemoyer, L., Neubig, G., Weld, D., Downey, D., tau Yih, W., Koh, P.W., Hajishirzi, H.: Openscholar: Synthesizing scientific literature with retrieval-augmented lms (...
- [3]
- [4]
- [5]
-
[6]
Hu, S., Lu, C., Clune, J.: Automated design of agentic systems (2025), https://arxiv.org/abs/2408.08435
work page internal anchor Pith review arXiv 2025
-
[7]
Huang, Y., Chen, Y., Zhang, H., Li, K., Zhou, H., Fang, M., Yang, L., Li, X., Shang, L., Xu, S., Hao, J., Shao, K., Wang, J.: Deep research agents: A systematic examination and roadmap (2025), https://arxiv.org/abs/2506.18096 Self-Optimizing Multi-Agent Systems for Deep Research 9
-
[8]
In: The Twelfth International Conference on Learning Representations (2024)
Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vard- hamanan, S., Haq, S., Sharma, A., Joshi, T.T., Moazam, H., Miller, H., Zaharia, M., Potts, C.: Dspy: Compiling declarative language model calls into self-improving pipelines. In: The Twelfth International Conference on Learning Representations (2024)
work page 2024
-
[9]
In: Advances in Neural Information Processing Systems (NeurIPS) (2020)
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)
work page 2020
- [10]
-
[11]
Rozanov, N., Rei, M.: StateAct: Enhancing LLM base agents via self-prompting andstate-tracking.In:Proceedingsofthe1stWorkshopforResearchonAgentLan- guage Models (REALM 2025) (2025), https://aclanthology.org/2025.realm-1.27
work page 2025
-
[12]
Shao, R., Asai, A., Shen, S.Z., Ivison, H., Kishore, V., Zhuo, J., Zhao, X., Park, M., Finlayson, S.G., Sontag, D., Murray, T., Min, S., Dasigi, P., Soldaini, L., Brahman, F., tau Yih, W., Wu, T., Zettlemoyer, L., Kim, Y., Hajishirzi, H., Koh, P.W.: Dr tulu: Reinforcement learning with evolving rubrics for deep research (2025), https://arxiv.org/abs/2511.19399
- [13]
-
[14]
Sharma, M., Zhang, C.B.C., Bandi, C., Wang, C., Aich, A., Nghiem, H., Rabbani, T., Htet, Y., Jang, B., Basu, S., Balwani, A., Peskoff, D., Ayestaran, M., Hendryx, S.M., Kenstler, B., Liu, B.: Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents (2025), https://arxiv.org/abs/2511.07685
-
[15]
Shi, Z., Chen, Y., Li, H., Sun, W., Ni, S., Lyu, Y., Fan, R.Z., Jin, B., Weng, Y., Zhu, M., Xie, Q., Guo, X., Yang, Q., Wu, J., Zhao, J., Tang, X., Ma, X., Wang, C., Mao, J., Ai, Q., Huang, J.T., Wang, W., Zhang, Y., Yang, Y., Tu, Z., Ren, Z.: Deep research: A systematic survey (2025), https://arxiv.org/abs/2512.02038
- [16]
- [17]
-
[18]
Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q.V., Zhou, D., Chen, X.: Large language models as optimizers (2024), https://arxiv.org/abs/2309.03409
work page internal anchor Pith review arXiv 2024
-
[19]
TextGrad: Automatic "Differentiation" via Text
Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Huang, Z., Guestrin, C., Zou, J.: Textgrad: Automatic "differentiation" via text (2024), https://arxiv.org/abs/2406.07496
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [20]
-
[21]
Zhang, W., Tang, K., Wu, H., Wang, M., Shen, Y., Hou, G., Tan, Z., Li, P., Zhuang, Y., Lu, W.: Agent-pro: Learning to evolve via policy-level reflection and optimiza- tion. In: Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (ACL) (2024), https://aclanthology.org/2024.acl-long.292
work page 2024
-
[22]
Zhou, H., Wan, X., Sun, R., Palangi, H., Iqbal, S., Vulić, I., Korhonen, A., Arık, S.Ö.: Multi-agent design: Optimizing agents with better prompts and topologies (2025), https://arxiv.org/abs/2502.02533 10 Arthur Câmara, Vincent Slot, and Jakub Zavrel A Minimal prompts Minimalorchestratorprompt: Given a user query , create a report that answer the user ’ ...
-
[23]
The ‘ orchestrator ‘ receives a user ’ s question and devises a plan with a list of research tasks that need to be co mpl et ed before writing the final report
-
[24]
The i n f o r m a t i o n of all search results pages is then combined by the ‘ aggregator ‘
Each task ’ s query is su bm it ted to a search engine , and the relevant i n f o r m a t i o n from each results page is ex tr ac te d by the ‘ reader ‘. The i n f o r m a t i o n of all search results pages is then combined by the ‘ aggregator ‘. Self-Optimizing Multi-Agent Systems for Deep Research 11
-
[25]
The ‘ orchestrator ‘ reads the merged i n f o r m a t i o n for all s ub mi tt ed tasks and decides to either run another round of tasks or call the ‘ writer ‘
-
[26]
The ‘ writer ‘ receives all the i n f o r m a t i o n from the tasks and writes a final report . I will provide you with a list of examples of di ff er ent task inputs provided to a single agent , together with some feedback on the quality of the output ge ner at ed by the agent using its current i n s t r u c t i o n s . Read the inputs c ar ef ul ly and...
-
[27]
Make sure your new i n s t r u c t i o n s are g e n e r a l i z a b l e to any computer science related task , and not specific to any p a r t i c u l a r task present in the examples
-
[28]
Do not include any other i n f o r m a t i o n or comments in your response
-
[29]
In this round , you are o p t i m i z i n g the prompt of the ‘{{ a g e n t _ n a m e }} ‘ agent
Do not suggest or imply any f o r m a t t i n g to the output of the agent , like r eq uir in g the output to be a JSON or have specific fields , unless this is already present in the current i n s t r u c t i o n s . In this round , you are o p t i m i z i n g the prompt of the ‘{{ a g e n t _ n a m e }} ‘ agent . C Exploration trees Figure2showsanexampl...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.