pith. machine review for the scientific record. sign in

arxiv: 2605.08686 · v1 · submitted 2026-05-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:05 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-agent systemslarge language modelsiterative routingcritique-and-routingMarkov Decision Processpolicy gradientsheterogeneous agentsreasoning benchmarks
0
0 comments X

The pith

The critique-and-routing controller models multi-agent LLM coordination as a finite-horizon MDP and optimizes it via policy gradients to enable iterative refinement that outperforms one-shot routing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that existing one-shot routing in multi-agent LLM systems can be improved by allowing a controller to critique intermediate drafts at each step, decide whether to stop, and route to the next suitable agent if refinement is needed. This sequential approach matters because it directly addresses the lack of mechanisms for iterative improvement and the tendency to overuse strong but costly models in heterogeneous pools. By casting the decisions as a finite-horizon Markov Decision Process with agent-utilization constraints and a composite reward, then optimizing the policy under Lagrangian relaxation, the controller learns cost-aware stopping and routing rules. Experiments across multiple systems and seven benchmarks support that the resulting policy beats baselines while keeping the strongest agent below 25 percent of calls.

Core claim

The central claim is that formulating the controller's per-turn choices—evaluate the current draft, stop or continue, and select the next agent—as a finite-horizon MDP with explicit utilization constraints, a composite reward across turns, and policy-gradient optimization under a Lagrangian-relaxed objective yields a policy that consistently outperforms one-shot baselines, narrows the gap to the strongest agent, and achieves this with under 25 percent usage of that agent across heterogeneous multi-agent setups and reasoning benchmarks.

What carries the argument

The critique-and-routing controller formulated as a finite-horizon Markov Decision Process with agent-utilization constraints and a composite reward, optimized by policy gradients under Lagrangian relaxation.

If this is right

  • The controller outperforms state-of-the-art one-shot routing baselines across seven reasoning benchmarks.
  • It substantially narrows the performance gap to the strongest agent in the pool.
  • The strongest agent is invoked for fewer than 25 percent of total calls.
  • The approach works across multiple heterogeneous multi-agent systems.
  • Sequential decisions support iterative refinement of intermediate drafts rather than single-shot outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The sequential MDP framing could be extended to other orchestration tasks where intermediate quality checks matter, such as code generation or multi-step planning.
  • Explicit constraints on agent usage during training may prove more effective than post-hoc filtering for controlling inference costs in production deployments.
  • The composite reward structure points to a general way to encode quality-cost tradeoffs that other LLM routing methods could adopt.
  • Testing generalization when the agent pool changes dynamically would clarify how robust the learned policy remains outside fixed training conditions.

Load-bearing premise

The MDP formulation, composite reward, and Lagrangian-relaxed policy-gradient optimization produce decisions that generalize across benchmarks and agent pools without bias introduced by the reward design or constraint handling.

What would settle it

A new experiment on an unseen reasoning benchmark or different LLM pool in which the controller shows no outperformance over one-shot baselines or requires the strongest agent in more than 25 percent of calls would falsify the performance claims.

Figures

Figures reproduced from arXiv: 2605.08686 by Christopher G. Brinton, Dong-Jun Han, Guangchen Lan, Liangqi Yuan, Wenzhi Fang.

Figure 1
Figure 1. Figure 1: Controller interacts with a heterogeneous pool of agents over mul [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Multi-turn rollout tree for GRPO training of the critique-and-routing controller. From [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics for two multi-agent systems with different controller and agent-pool configurations. System 1 uses Qwen2.5-7B-Base as the controller with agents Qwen3-30B-A3B￾Instruct (Agent 1), Qwen2.5-7B-Instruct (Agent 2), and Qwen2.5-1.5B-Instruct. System 2 uses Qwen3-4B-Base as the controller with agents Qwen3-30B-A3B-Instruct (Agent 1), Ministral-3-8B￾Instruct (Agent 2), and Llama-3-2-1B-Instruct. … view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy under differ￾ent maximum turn budgets 1 2 3 Maximum Number of Turns 0.00 0.25 0.50 0.75 1.00 Fraction Exhausted System 1 System 2 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity of our method to different Lagrangian multipliers. We vary the multiplier [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Multi-agent large language model (LLM) systems often rely on a controller to coordinate a pool of heterogeneous models, yet existing controllers are typically limited to one-shot routing: they select a model once and return its output directly. Such routing-only designs provide no mechanism to critique intermediate drafts or support iterative refinement. To address this limitation, we propose a critique-and-routing controller that casts multi-agent coordination as a sequential decision problem. At each turn, the controller evaluates the current draft, decides whether to stop or continue, and, if needed, selects the next agent for further refinement. We formulate this process as a finite-horizon Markov Decision Process (MDP) with explicit agent-utilization constraints, design a composite reward for controller decisions across turns, and optimize the controller via policy gradients under a Lagrangian-relaxed objective. Extensive experiments across multiple heterogeneous multi-agent systems and seven reasoning benchmarks show that our method consistently outperforms state-of-the-art baselines and substantially narrows the gap to the strongest agent, while using it for fewer than 25% of total calls.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes an iterative critique-and-routing controller for multi-agent systems with heterogeneous LLMs. It formulates coordination as a finite-horizon MDP with explicit agent-utilization constraints, designs a composite reward function, and optimizes the controller via policy gradients under Lagrangian relaxation. Experiments across multiple heterogeneous multi-agent setups and seven reasoning benchmarks show consistent outperformance over state-of-the-art baselines, narrowing the gap to the strongest agent while invoking it for fewer than 25% of total calls.

Significance. If the empirical results hold, the work offers a practical mechanism for iterative refinement and efficient routing in LLM multi-agent systems, reducing reliance on the strongest (often costliest) models. A notable strength is the prompt-based implementation with zero additional parameters, where the LLM approximates the policy and composite rewards/constraints are enforced at inference time rather than via gradient updates; this sidesteps training instability. The inclusion of reward-component ablations and cross-benchmark generalization checks further supports the claims.

major comments (1)
  1. [Abstract and Methods] Abstract and Methods: The abstract claims the controller is optimized 'via policy gradients under a Lagrangian-relaxed objective,' yet the methods describe a zero-parameter prompt-based controller that approximates the policy directly with the LLM and enforces the composite reward plus constraints at inference time without any gradient-based parameter updates. This mismatch is load-bearing for the central methodological claim and requires explicit reconciliation (e.g., clarifying that the MDP is a conceptual model rather than a trained policy).
minor comments (2)
  1. [§4] §4 (Experiments): The description of baseline implementations could be expanded to include exact prompt templates and hyperparameter settings used for fair comparison.
  2. [Table 1 and Figure 3] Table 1 and Figure 3: Axis labels and legend entries should explicitly state the metric (e.g., accuracy vs. cost) and the exact fraction of calls to the strongest agent to improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The feedback highlights an important point of potential confusion in how the methodological contribution is presented. We address the major comment below and will revise the manuscript to resolve the inconsistency.

read point-by-point responses
  1. Referee: [Abstract and Methods] Abstract and Methods: The abstract claims the controller is optimized 'via policy gradients under a Lagrangian-relaxed objective,' yet the methods describe a zero-parameter prompt-based controller that approximates the policy directly with the LLM and enforces the composite reward plus constraints at inference time without any gradient-based parameter updates. This mismatch is load-bearing for the central methodological claim and requires explicit reconciliation (e.g., clarifying that the MDP is a conceptual model rather than a trained policy).

    Authors: We agree that the current wording in the abstract creates a misleading impression. The finite-horizon MDP formulation is used strictly as a conceptual model to derive the structure of the composite reward function and the explicit agent-utilization constraints. In the actual implementation, the controller is realized as a zero-parameter, prompt-based system: the LLM directly approximates the policy by generating decisions at inference time, while the reward components and constraints are enforced through carefully designed prompts rather than through any gradient-based optimization or Lagrangian relaxation during training. No policy-gradient updates or parameter optimization occur. We will revise the abstract (and the opening of the Methods section) to explicitly state that the MDP serves as a design framework rather than a trained policy, and that the controller is implemented via prompting without gradient-based training. This change will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper formulates multi-agent coordination as a finite-horizon MDP with agent-utilization constraints and a composite reward, then implements the controller as a prompt-based LLM policy (zero trainable parameters) that enforces the reward and constraints directly at inference time. This approach does not involve fitting parameters to the evaluation benchmarks or renaming fitted quantities as predictions; the reported outperformance is measured on held-out reasoning tasks with explicit ablations and cross-benchmark checks. No load-bearing step reduces by construction to the inputs via self-definition, self-citation chains, or ansatz smuggling. The central claim therefore rests on independent empirical evidence rather than tautological equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5493 in / 1248 out tokens · 58670 ms · 2026-05-12T01:05:39.055654+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 8 internal anchors

  1. [1]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176, 2023

  2. [2]

    Routerdc: Query-based router by dual contrastive learning for assembling large language models.Advances in Neural Information Processing Systems, 37:66305–66328, 2024

    Shuhao Chen, Weisen Jiang, Baijiong Lin, James Kwok, and Yu Zhang. Routerdc: Query-based router by dual contrastive learning for assembling large language models.Advances in Neural Information Processing Systems, 37:66305–66328, 2024

  3. [3]

    Hybrid LLM: Cost-efficient and quality-aware query routing

    Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality- aware query routing.arXiv preprint arXiv:2404.14618, 2024

  4. [4]

    Aurélien Garivier and Eric Moulines

    Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Del Car- men Hipolito Garcia, Menglin Xia, Laks VS Lakshmanan, Qingyun Wu, and Victor R¨uhle. Best- route: Adaptive llm routing with test-time optimal compute.arXiv preprint arXiv:2506.22716, 2025

  5. [5]

    Acc-collab: An actor-critic approach to multi-agent llm collaboration.arXiv preprint arXiv:2411.00053, 2024

    Andrew Estornell, Jean-Francois Ton, Yuanshun Yao, and Yang Liu. Acc-collab: An actor-critic approach to multi-agent llm collaboration.arXiv preprint arXiv:2411.00053, 2024

  6. [6]

    Omni-math: A universal olympiad level mathematic benchmark for large language models

    Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models.arXiv preprint arXiv:2410.07985, 2024

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  8. [8]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021

  9. [9]

    Metagpt: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations, 2023

  10. [10]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

  11. [11]

    arXiv preprint arXiv:2511.02755 , year =

    Bowen Jin, TJ Collins, Donghan Yu, Mert Cemri, Shenao Zhang, Mengyu Li, Jay Tang, Tian Qin, Zhiyang Xu, Jiarui Lu, et al. Controlling performance and budget of a centralized multi-agent llm system with reinforcement learning.arXiv preprint arXiv:2511.02755, 2025

  12. [12]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  13. [13]

    Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 2022

  14. [14]

    arXiv preprint arXiv:2410.02189 , year=

    Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. Agent-oriented planning in multi-agent systems.arXiv preprint arXiv:2410.02189, 2024

  15. [15]

    Camel: Communicative agents for” mind” exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for” mind” exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023. 10

  16. [16]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

  17. [17]

    A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

    Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

  18. [18]

    Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models

    Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025

  19. [19]

    Routing to the expert: Efficient reward-guided ensemble of large language models

    Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1964–1974, 2024

  20. [20]

    Omnirouter: Budget and performance controllable multi-llm routing.ACM SIGKDD Explorations Newsletter, 27(2):107–116, 2025

    Kai Mei, Wujiang Xu, Minghao Guo, Shuhang Lin, and Yongfeng Zhang. Omnirouter: Budget and performance controllable multi-llm routing.ACM SIGKDD Explorations Newsletter, 27(2):107–116, 2025

  21. [21]

    Malt: Improving reasoning with multi-agent llm training.arXiv preprint arXiv:2412.01928, 2024

    Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Ivan Laptev, Philip HS Torr, Fabio Pizzati, Ronald Clark, and Christian Schroeder de Witt. Malt: Improving reasoning with multi-agent llm training.arXiv preprint arXiv:2412.01928, 2024

  22. [22]

    RouteLLM: Learning to Route LLMs with Preference Data

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665, 2024

  23. [23]

    Online multi-llm selection via contextual bandits under unstructured context evolution

    Manhin Poon, XiangXiang Dai, Xutong Liu, Fang Kong, John CS Lui, and Jinhang Zuo. Online multi-llm selection via contextual bandits under unstructured context evolution. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 24855–24863, 2026

  24. [24]

    John Wiley & Sons, 2014

    Martin L Puterman.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

  25. [25]

    Chatdev: Communicative agents for software development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174–15186, 2024

  26. [26]

    Navigating complexity: Orchestrated problem solving with multi-agent llms.arXiv preprint arXiv:2402.16713, 2024

    Sumedh Rasal and Evi J Hauer. Navigating complexity: Orchestrated problem solving with multi-agent llms.arXiv preprint arXiv:2402.16713, 2024

  27. [27]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

  28. [28]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  29. [29]

    Route-and-reason: Scaling large language model reasoning with reinforced model router.arXiv preprint arXiv:2506.05901, 2025

    Chenyang Shao, Xinyang Liu, Yutang Lin, Fengli Xu, and Yong Li. Route-and-reason: Scaling large language model reasoning with reinforced model router.arXiv preprint arXiv:2506.05901, 2025

  30. [30]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  31. [31]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025. 11

  32. [32]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998

  33. [33]

    Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023

    Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023

  34. [34]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

  35. [35]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  36. [36]

    Router-r1: Teaching llms multi-round routing and aggregation via reinforcement learning

    Haozhen Zhang, Tao Feng, and Jiaxuan You. Router-r1: Teaching llms multi-round routing and aggregation via reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  37. [37]

    TX t=1 γt−1R(st,a t)− TX t=1 KX k=1 ξk 1(f(a t) =k) # . (9) Under the reward definition in (1) andγ= 1, Problem (9) simplifies to: maximize θ Eq∼D Eat∼πθ(·|st)

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. InFindings of the association for computational linguistics: NAACL 2024, pages 2299–2314, 2024. 12 Appendix A Further Experiments 14 A.1 Other Controller Choices . . . . . . ...

  38. [38]

    Calculate the slope using the points (2, -5) and (p, -14): slope= −14−(−5) p−2 = −9 p−2

  39. [39]

    Calculate the slope using the points (2, -5) and (p+2, -17): slope= −17−(−5) (p+ 2)−2 = −12 p

  40. [40]

    Set the slopes equal to each other and solve forp: −9 p−2 = −12 p Cross-multiplying gives: −9p=−12(p−2) −9p=−12p+ 24 3p= 24 p= 8

  41. [41]

    Verify the equation of the line using the point (2, -5) and the slope− 3 2 : y+ 5 =− 3 2 (x−2) y=− 3 2 x+ 3−5 y=− 3 2 x−2

  42. [42]

    Find the value ofqwhenx= 13: q=− 3 2 (13)−2 =− 39 2 −2 =− 39 2 − 4 2 =− 43 2 =−21.5

  43. [43]

    Turn 1 (Controller): <thinking>This is a simple problem involving rationalizing the denominator

    Calculatep+q: p+q= 8 + (−21.5) =−13.5 </checking> <verdict>True</verdict> Example 2: Exit in 3 Turns Question:Rationalize the denominator: 1 2 √ 7 . Turn 1 (Controller): <thinking>This is a simple problem involving rationalizing the denominator. It is a basic algebraic manipulation problem that does not require advanced mathematical knowledge or complex p...

  44. [45]

    To rationalize the denominator, we multiply the numerator and the denominator by2 √ 7

  45. [46]

    This gives us 2 √ 7 2 √ 7·2 √ 7 = √ 7 2·7 = √ 7 14

  46. [47]

    We need to route the problem to a stronger agent to refine the solution

    The proposed solution seems to have an error in step 4. We need to route the problem to a stronger agent to refine the solution. Qwen2.5-7B-Instruct is good choice, which is stronger than Qwen2.5-1.5B-Instruct. </checking> <verdict>False</verdict> <model>Qwen2.5-7B-Instruct</model> Response from Qwen2.5-7B-Instruct:To rationalize the denominator of the ex...

  47. [48]

    The given expression is 1 2 √ 7

  48. [49]

    To rationalize the denominator, we multiply the numerator and the denominator by √ 7

  49. [50]

    This gives us √ 7 2 √ 7· √ 7 = √ 7 2·7 = √ 7 14

  50. [51]

    </checking> <verdict>True</verdict> 24

    The proposed solution is correct. </checking> <verdict>True</verdict> 24