arxiv: 2604.18612 · v1 · submitted 2026-04-14 · 💻 cs.NE · cs.AI· cs.LG

Recognition: unknown

Agent-GWO: Collaborative Agents for Dynamic Prompt Optimization in Large Language Models

Xudong Wang , Chaoning Zhang , Chenghao Li , Shuxu Chen , Qigan Sun , Jiaquan Zhang , Fachrina Dewi Puspitasari , Tae-Ho Kim

show 5 more authors

Jiwei Wei Malu Zhang Guoqing Wang Yang Yang Heng Tao Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:21 UTC · model grok-4.3

classification 💻 cs.NE cs.AIcs.LG

keywords prompt optimizationgrey wolf optimizermulti-agent systemslarge language modelsreasoning benchmarksdecoding hyperparametersdynamic optimization

0 comments

The pith

By modeling prompts and decoding settings as inheritable agent configurations and using grey wolf leader-follower updates, Agent-GWO produces more accurate and stable LLM reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that a multi-agent system drawing on the grey wolf optimizer can jointly tune prompt templates and decoding hyperparameters to improve performance on complex reasoning tasks. Existing single-agent or manual approaches often produce unstable results that vary with task distribution and model choice. The method selects three leader agents to direct collaborative updates across the group, driving iterative refinement toward usable optimal configurations. A sympathetic reader would care because this could cut down on hand-crafted prompt engineering while making LLM outputs more reliable across different backbones. Experiments on mathematical and hybrid benchmarks support gains in both accuracy and stability.

Core claim

Agent-GWO unifies prompt templates and decoding hyperparameters as inheritable agent configurations. It applies the leader-follower mechanism of the Grey Wolf Optimizer by automatically selecting three leader agents to guide the collaborative updates of the remaining agents. This process enables iterative convergence toward robust optimal reasoning configurations that integrate directly into LLM inference.

What carries the argument

The leader-follower mechanism of the Grey Wolf Optimizer applied to agent configurations that represent both prompt templates and decoding hyperparameters.

If this is right

Prompts and decoding hyperparameters are optimized together inside one framework instead of separately.
Performance becomes more stable across varying task distributions and LLM backbones.
Reliance on manual static prompt design for complex reasoning decreases.
Optimized configurations transfer more readily between different models and tasks.
The final setups integrate directly into standard LLM inference without extra adjustments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agent-configuration approach could be tested on optimization of other LLM controls such as temperature schedules or output length constraints.
Evolutionary multi-agent methods of this type might extend to prompt tuning for non-reasoning tasks like code generation or summarization.
Further experiments on much larger or more diverse sets of LLMs would show how sensitive the reported gains are to the specific backbones used.

Load-bearing premise

That treating prompts and hyperparameters as inheritable agent configurations and applying the grey wolf leader-follower mechanism will produce optima that transfer robustly rather than overfitting to the tested benchmarks or models.

What would settle it

Applying the Agent-GWO-optimized configurations to a new, unseen reasoning benchmark or different LLM backbone and checking whether accuracy and stability improvements appear compared with baseline prompt methods.

Figures

Figures reproduced from arXiv: 2604.18612 by Chaoning Zhang, Chenghao Li, Fachrina Dewi Puspitasari, Guoqing Wang, Heng Tao Shen, Jiaquan Zhang, Jiwei Wei, Malu Zhang, Qigan Sun, Shuxu Chen, Tae-Ho Kim, Xudong Wang, Yang Yang.

**Figure 2.** Figure 2: The overall Agent-GWO framework operates by having each agent process a dataset, Prompt, and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy over varying number of agents on [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Illustration of GWO algorithm. achieves a balance between global search and local exploitation. As iterations progress, the decreasing parameter a(t) ensures a smooth transition from exploration to exploitation, enhancing convergence toward the global optimum. A.2 LLM-as-a-Judge Protocol We adopt a controlled LLM-as-a-Judge protocol as an auxiliary evaluation mechanism to assess the quality of intermediate… view at source ↗

read the original abstract

Large Language Models (LLMs) have demonstrated strong capabilities in complex reasoning tasks, while recent prompting strategies such as Chain-of-Thought (CoT) have further elevated their performance in handling complex logical problems. Despite these advances, high-quality reasoning remains heavily reliant on manual static prompts and is sensitive to decoding configurations and task distributions, leading to performance fluctuations and limited transferability. Existing automatic prompt optimization methods typically adopt single-agent local search, failing to simultaneously optimize prompts and decoding hyperparameters within a unified framework to achieve stable global improvements. To address this limitation, we propose Agent-GWO, a dynamic prompt optimization framework for complex reasoning. Specifically, we unify prompt templates and decoding hyperparameters as inheritable agent configurations. By leveraging the leader-follower mechanism of the Grey Wolf Optimizer (GWO), we automatically select three leader agents ($\alpha$, $\beta$, and $\delta$) to guide the collaborative updates of the remaining agents, enabling iterative convergence toward robust optimal reasoning configurations that can be seamlessly integrated for inference. Extensive experiments on multiple mathematical and hybrid reasoning benchmarks across diverse LLM backbones show that Agent-GWO consistently improves accuracy and stability over existing prompt optimization methods. The code will be released publicly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agent-GWO applies multi-agent GWO to jointly tune prompts and decoding hyperparameters, but the encoding of text prompts into the optimizer's vector updates is the part that needs the most scrutiny.

read the letter

Agent-GWO is basically taking the Grey Wolf Optimizer and turning it into a multi-agent thing to tune prompts and decoding settings at the same time. That's the main new angle here. It does a good job pointing out that current prompt optimization is mostly single-agent local search and doesn't handle the decoding params together. Treating the whole config as inheritable agents with leaders guiding the updates is a reasonable way to try for better global search. The experiments on math and hybrid reasoning tasks across different LLMs sound like they cover a decent range, and releasing the code will help. The soft spot is the representation. GWO moves agents around in a vector space using the positions of the top wolves. For text prompts, you have to have some way to map that back and forth, whether it's embeddings, token sequences, or some other encoding. If that's not done carefully, the collaborative updates might not add much beyond what simpler methods do, and you risk the optima being specific to the benchmarks rather than generally useful. The abstract doesn't spell out the fitness function or how many times they ran the optimization either, so the 'consistent improvements' are hard to evaluate without the numbers and stats. This is for folks who work on making LLMs more reliable in reasoning tasks without hand-tuning everything. If the encoding holds up and the results are robust, it could be useful for practitioners. I'd say bring it to the reading group to talk about applying swarm methods to LLM configs. It deserves peer review because the idea is grounded in a real problem and extends existing optimizers in a non-trivial way, even if the details will need checking.

Referee Report

4 major / 0 minor

Summary. The paper proposes Agent-GWO, a dynamic prompt optimization framework for LLMs that unifies prompt templates and decoding hyperparameters as inheritable agent configurations. It employs the leader-follower mechanism of the Grey Wolf Optimizer (GWO) with alpha, beta, and delta leaders to guide collaborative updates of agents towards optimal configurations for complex reasoning tasks. The authors report that extensive experiments on mathematical and hybrid reasoning benchmarks across diverse LLM backbones demonstrate consistent improvements in accuracy and stability over existing prompt optimization methods.

Significance. If the central claims hold, this work could advance the field of automated prompt engineering by demonstrating the effectiveness of evolutionary optimization techniques, specifically GWO's hierarchical collaboration, in jointly optimizing discrete prompts and continuous hyperparameters. This might lead to more robust and transferable reasoning configurations compared to single-agent approaches. The planned public release of code is a positive step for reproducibility.

major comments (4)

Abstract: The abstract claims 'consistent improvements in accuracy and stability' but provides no quantitative results, such as specific accuracy gains, number of runs, error bars, or statistical tests. This omission makes it impossible to assess the practical significance of the reported gains without the full experimental section.
Method section: The encoding of discrete prompt templates into the continuous search space for GWO updates is not described. GWO relies on vector-based position updates; without specifying how text prompts are represented (e.g., via embeddings, tokenization, or mutation operators), it is unclear whether the leader-follower mechanism performs meaningful optimization or effectively reduces to heuristic search.
Method section: The fitness function used to evaluate agent configurations and whether the optimization is performed jointly across benchmarks or independently per task/benchmark is not specified. This detail is load-bearing for the claim of robust, transferable optima rather than benchmark-specific overfitting.
Experiments section: The results section should include details on the number of independent optimization runs, variance across runs, and formal statistical comparisons (e.g., t-tests or Wilcoxon tests) against baselines to substantiate the 'consistent' and 'stable' improvements.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: Abstract: The abstract claims 'consistent improvements in accuracy and stability' but provides no quantitative results, such as specific accuracy gains, number of runs, error bars, or statistical tests. This omission makes it impossible to assess the practical significance of the reported gains without the full experimental section.

Authors: We agree that quantitative highlights would strengthen the abstract. In the revision we will add concise references to key accuracy gains and stability metrics from the experiments section (while respecting length limits) and explicitly note the use of multiple runs and statistical validation reported later in the paper. revision: yes
Referee: Method section: The encoding of discrete prompt templates into the continuous search space for GWO updates is not described. GWO relies on vector-based position updates; without specifying how text prompts are represented (e.g., via embeddings, tokenization, or mutation operators), it is unclear whether the leader-follower mechanism performs meaningful optimization or effectively reduces to heuristic search.

Authors: We acknowledge the description was insufficient. The revised method section will provide a complete account of how discrete prompt templates are mapped into the continuous search space, including the representation chosen and the operators used to enable meaningful leader-follower updates under GWO. revision: yes
Referee: Method section: The fitness function used to evaluate agent configurations and whether the optimization is performed jointly across benchmarks or independently per task/benchmark is not specified. This detail is load-bearing for the claim of robust, transferable optima rather than benchmark-specific overfitting.

Authors: We will explicitly define the fitness function (task accuracy on held-out validation examples) and clarify that optimization runs are performed independently per benchmark, with subsequent cross-benchmark evaluation used to demonstrate transferability of the resulting configurations. revision: yes
Referee: Experiments section: The results section should include details on the number of independent optimization runs, variance across runs, and formal statistical comparisons (e.g., t-tests or Wilcoxon tests) against baselines to substantiate the 'consistent' and 'stable' improvements.

Authors: We agree these details are necessary. The revised experiments section will report the number of independent runs, include variance measures, and present formal statistical comparisons (paired t-tests or Wilcoxon tests) against all baselines to support the claims of consistent and stable gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework applies external GWO to unified configurations with independent experimental validation

full rationale

The paper introduces Agent-GWO by treating prompt templates and decoding hyperparameters as inheritable agent configurations and applying the pre-existing Grey Wolf Optimizer leader-follower hierarchy (alpha-beta-delta) for collaborative updates. No equations, fitted parameters, or self-referential definitions appear that would make the reported accuracy/stability gains tautological or reduce them to the inputs by construction. The central claim is supported by experiments on multiple benchmarks and LLM backbones rather than by any derivation chain that loops back to its own assumptions or prior self-citations. The method is presented as an engineering application of a known optimizer, not a self-derived result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities can be identified from the abstract alone; a full-text audit would be required to list any fitted scales, convergence assumptions, or new agent constructs.

pith-pipeline@v0.9.0 · 5560 in / 1151 out tokens · 22879 ms · 2026-05-10T14:21:26.397141+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Measuring Mathematical Problem Solving With the MATH Dataset

Distributed constraint optimization problems and applications: A survey.Journal of Artificial Intelligence Research, 61:623–698. ACL Fipa. 2002. Fipa acl message structure specifi- cation.Foundation for Intelligent Physical Agents, http://www. fipa. org/specs/fipa00061/SC00061G. html (30.6. 2004). Yancheng He, Shilong Li, Jiaheng Liu, Weixun Wang, Xingyua...

work page internal anchor Pith review Pith/arXiv arXiv 2002
[2]

In2024 IEEE 20th Interna- tional Conference on Automation Science and Engi- neering (CASE), pages 3940–3946

Large language model-enabled multi-agent manufacturing systems. In2024 IEEE 20th Interna- tional Conference on Automation Science and Engi- neering (CASE), pages 3940–3946. IEEE. Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program induction by rationale generation: Learning to solve and explain al- gebraic word problems. https://github.c...

2017
[3]

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Advances and challenges in foundation agents: From brain-inspired intelligence to evolution- ary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990. Yujun Liu, Xinyun Chen, Jacob Tworek, Qiming Yuan, Maarten Bosma, Yi Luan, Denny Zhou, Quoc V Le, and Luke Zettlemoyer. 2023. Graph-of-thought prompting: Structuring reasoning as dependency grap...

work page Pith review arXiv 2023
[4]

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian

A diverse corpus for evaluating and developing english math word problem solvers. InProceedings of the 58th annual meeting of the Association for Computational Linguistics, pages 975–984. Seyedali Mirjalili, Seyed Mohammad Mirjalili, and An- drew Lewis. 2014. Grey wolf optimizer.Advances in engineering software, 69:46–61. Humza Naveed, Asad Ullah Khan, Sh...

work page arXiv 2014
[5]

Ofir Press, Mike Lewis Lee, and Luke Zettlemoyer

Are nlp models really able to solve simple math word problems? InProceedings of the 2021 conference of the North American chapter of the as- sociation for computational linguistics: human lan- guage technologies, pages 2080–2094. Ofir Press, Mike Lewis Lee, and Luke Zettlemoyer

2021
[6]

InAdvances in Neural Information Processing Systems, volume 35, pages 24804–24817

Self-ask with search: Bootstrapping reason- ing with chain-of-thought. InAdvances in Neural Information Processing Systems, volume 35, pages 24804–24817. Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, and 1 others. 2018. Improving language understanding by generative pre-training. Subhro Roy and Dan Roth. 2015. Solving general arith- meti...

work page arXiv 2018
[7]

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se- bastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and 1 others

Mathdivide: Improved mathematical rea- soning by large language models.arXiv preprint arXiv:2405.13004. Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se- bastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and 1 others. 2023. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of t...

work page arXiv 2023
[8]

Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation.arXiv preprint arXiv:2512.06690.2025

Think-while-generating: On-the-fly reason- ing for personalized long-form generation.arXiv preprint arXiv:2512.06690. Junxi Wang, Te Sun, Jiayi Zhu, Junxian Li, Haowen Xu, Zichen Wen, Xuming Hu, Zhiyu Li, and Lin- feng Zhang. 2026. Streammeco: Long-term agent memory compression for efficient streaming video understanding.Preprint, arXiv:2604.09000. Xuezhi...

work page arXiv 2026
[9]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Autogen: Enabling next-gen llm applica- tions via multi-agent conversation.arXiv preprint arXiv:2308.08155. Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yi- wen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, and 1 others. 2025. The rise and potential of large language model based agents: A survey.Science China Information Sci- ences, 68...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

arXiv preprint arXiv:2310.03094 , year=

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822. Murong Yue, Jie Zhao, Min Zhang, Liang Du, and Ziyu Yao. 2023. Large language model cascades with mixture of thoughts representations for cost-efficient reasoning.arXiv preprint arXiv:2310.03094. An Zhang, Yuxin Chen, ...

work page arXiv 2023
[11]

arXiv preprint arXiv:2508.01782 , year=

Joint lossless compression and steganography for medical images via large language models.arXiv preprint arXiv:2508.01782. Pengcheng Zheng, Chaoning Zhang, Mingzheng Cui, Guo Chen, Qigan Sun, Jiaxin Huang, Jiaquan Zhang, Tae-Ho Kim, Caiyan Qin, Yazhou Ren, and 1 others. 2026a. Towards visual chain-of-thought reasoning: A comprehensive survey. Pengcheng Zh...

work page arXiv 2022
[12]

Carefully read the entire question to un- derstand what is being asked
[13]

Identify and extract all relevant numer- ical data and quantities mentioned within the question
[14]

altogether,

Determine which mathematical opera- tions (addition, subtraction, multiplication, division) are necessary based on keywords or phrases that indicate relationships be- tween numbers (e.g., "altogether," "more than," etc.)
[15]

Perform calculations step-by-step while keeping track of intermediate results if needed to avoid errors
[16]

Double-check your final answer against the context of the problem to ensure it makes sense. Use this approach systematically for each question from the SV AMP dataset you en- counter! Answer by Agent 1:YYYY-MM-DD 22:35:52,418 - log.logger_utils - INFO Let’s follow the steps to solve the problem:
[17]

Read the entire question: Tiffany collected cans for recycling, and we need to find out how many bags of cans she had altogether after her collection over two days
[18]

On Tuesday, she found 7 more bags of cans

Extract relevant numerical data: On Monday, Tiffany had 3 bags of cans. On Tuesday, she found 7 more bags of cans
[19]

Determine necessary mathematical operations: We need to find the total number of bags of cans, which involves addition since we are combining quantities
[20]

Perform calculations step-by-step: Start with the number of bags she had on Monday: 3 Add the number of additional bags found on Tuesday: Total Bags= 3 + 7 = 10
[21]

cans" she has altogether, and we cal- culated that Tiffany has a total of 10 bags (only counting

Double-check final answer against con- text: The question asks for how many bags of "cans" she has altogether, and we cal- culated that Tiffany has a total of 10 bags (only counting "cans"). Therefore, Tiffany has a total of 10 bags of cans altogether. This example provides a detailed demonstration of how the first agent systematically solves a math- emat...
[23]

The next day, she found 7 more bags of cans. Now, we can calculate the total number of bags of cans: Total bags of cans = Bags on Monday + Bags found on Tuesday Total bags of cans = 3 + 7 Now, let’s do the calculation: Total bags of cans = 3 + 7 = 10 So, Tiffany had 10 bags of cans altogether. (Note: The bags of bottles are not included in the total for b...
[24]

Start with the number of bags of cans Tiffany had on Monday: 3 bags
[25]

Now, calculate the total number of bags of cans: 3bags (Monday)+ 7bags (Tuesday) = 10bags of cans

Add the number of bags of cans she found on Tuesday: 7 bags. Now, calculate the total number of bags of cans: 3bags (Monday)+ 7bags (Tuesday) = 10bags of cans. Therefore, Tiffany had 10 bags of cans alto- gether. (Note: The bags of bottles are not included in the total for cans.) Agent 4’s prompts emphasize problem genera- tion that incorporates real-life...
[26]

On Monday, Tiffany had 3 bags of cans
[27]

(Score: 90) Ingenuity (0 - 100):

On Tuesday, she found 7 more bags of cans. Now, we can add the number of bags of cans she had on Monday to the number she found on Tuesday: Total bags of cans=Bags on Monday +Bags found on Tuesday. Substituting the values: Total bags of cans= 3 + 7 Now, we perform the addition: Total bags of cans= 10 Thus, Tiffany had a total of 10 bags of cans altogether...