arxiv: 2605.08686 · v1 · submitted 2026-05-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs

Wenzhi Fang , Liangqi Yuan , Guangchen Lan , Dong-Jun Han , Christopher G. Brinton

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:05 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent systemslarge language modelsiterative routingcritique-and-routingMarkov Decision Processpolicy gradientsheterogeneous agentsreasoning benchmarks

0 comments

The pith

The critique-and-routing controller models multi-agent LLM coordination as a finite-horizon MDP and optimizes it via policy gradients to enable iterative refinement that outperforms one-shot routing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that existing one-shot routing in multi-agent LLM systems can be improved by allowing a controller to critique intermediate drafts at each step, decide whether to stop, and route to the next suitable agent if refinement is needed. This sequential approach matters because it directly addresses the lack of mechanisms for iterative improvement and the tendency to overuse strong but costly models in heterogeneous pools. By casting the decisions as a finite-horizon Markov Decision Process with agent-utilization constraints and a composite reward, then optimizing the policy under Lagrangian relaxation, the controller learns cost-aware stopping and routing rules. Experiments across multiple systems and seven benchmarks support that the resulting policy beats baselines while keeping the strongest agent below 25 percent of calls.

Core claim

The central claim is that formulating the controller's per-turn choices—evaluate the current draft, stop or continue, and select the next agent—as a finite-horizon MDP with explicit utilization constraints, a composite reward across turns, and policy-gradient optimization under a Lagrangian-relaxed objective yields a policy that consistently outperforms one-shot baselines, narrows the gap to the strongest agent, and achieves this with under 25 percent usage of that agent across heterogeneous multi-agent setups and reasoning benchmarks.

What carries the argument

The critique-and-routing controller formulated as a finite-horizon Markov Decision Process with agent-utilization constraints and a composite reward, optimized by policy gradients under Lagrangian relaxation.

If this is right

The controller outperforms state-of-the-art one-shot routing baselines across seven reasoning benchmarks.
It substantially narrows the performance gap to the strongest agent in the pool.
The strongest agent is invoked for fewer than 25 percent of total calls.
The approach works across multiple heterogeneous multi-agent systems.
Sequential decisions support iterative refinement of intermediate drafts rather than single-shot outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The sequential MDP framing could be extended to other orchestration tasks where intermediate quality checks matter, such as code generation or multi-step planning.
Explicit constraints on agent usage during training may prove more effective than post-hoc filtering for controlling inference costs in production deployments.
The composite reward structure points to a general way to encode quality-cost tradeoffs that other LLM routing methods could adopt.
Testing generalization when the agent pool changes dynamically would clarify how robust the learned policy remains outside fixed training conditions.

Load-bearing premise

The MDP formulation, composite reward, and Lagrangian-relaxed policy-gradient optimization produce decisions that generalize across benchmarks and agent pools without bias introduced by the reward design or constraint handling.

What would settle it

A new experiment on an unseen reasoning benchmark or different LLM pool in which the controller shows no outperformance over one-shot baselines or requires the strongest agent in more than 25 percent of calls would falsify the performance claims.

Figures

Figures reproduced from arXiv: 2605.08686 by Christopher G. Brinton, Dong-Jun Han, Guangchen Lan, Liangqi Yuan, Wenzhi Fang.

**Figure 2.** Figure 2: Multi-turn rollout tree for GRPO training of the critique-and-routing controller. From [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics for two multi-agent systems with different controller and agent-pool configurations. System 1 uses Qwen2.5-7B-Base as the controller with agents Qwen3-30B-A3BInstruct (Agent 1), Qwen2.5-7B-Instruct (Agent 2), and Qwen2.5-1.5B-Instruct. System 2 uses Qwen3-4B-Base as the controller with agents Qwen3-30B-A3B-Instruct (Agent 1), Ministral-3-8BInstruct (Agent 2), and Llama-3-2-1B-Instruct. … view at source ↗

**Figure 4.** Figure 4: Accuracy under different maximum turn budgets 1 2 3 Maximum Number of Turns 0.00 0.25 0.50 0.75 1.00 Fraction Exhausted System 1 System 2 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity of our method to different Lagrangian multipliers. We vary the multiplier [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

Multi-agent large language model (LLM) systems often rely on a controller to coordinate a pool of heterogeneous models, yet existing controllers are typically limited to one-shot routing: they select a model once and return its output directly. Such routing-only designs provide no mechanism to critique intermediate drafts or support iterative refinement. To address this limitation, we propose a critique-and-routing controller that casts multi-agent coordination as a sequential decision problem. At each turn, the controller evaluates the current draft, decides whether to stop or continue, and, if needed, selects the next agent for further refinement. We formulate this process as a finite-horizon Markov Decision Process (MDP) with explicit agent-utilization constraints, design a composite reward for controller decisions across turns, and optimize the controller via policy gradients under a Lagrangian-relaxed objective. Extensive experiments across multiple heterogeneous multi-agent systems and seven reasoning benchmarks show that our method consistently outperforms state-of-the-art baselines and substantially narrows the gap to the strongest agent, while using it for fewer than 25% of total calls.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes an iterative critique-and-routing controller for multi-agent systems with heterogeneous LLMs. It formulates coordination as a finite-horizon MDP with explicit agent-utilization constraints, designs a composite reward function, and optimizes the controller via policy gradients under Lagrangian relaxation. Experiments across multiple heterogeneous multi-agent setups and seven reasoning benchmarks show consistent outperformance over state-of-the-art baselines, narrowing the gap to the strongest agent while invoking it for fewer than 25% of total calls.

Significance. If the empirical results hold, the work offers a practical mechanism for iterative refinement and efficient routing in LLM multi-agent systems, reducing reliance on the strongest (often costliest) models. A notable strength is the prompt-based implementation with zero additional parameters, where the LLM approximates the policy and composite rewards/constraints are enforced at inference time rather than via gradient updates; this sidesteps training instability. The inclusion of reward-component ablations and cross-benchmark generalization checks further supports the claims.

major comments (1)

[Abstract and Methods] Abstract and Methods: The abstract claims the controller is optimized 'via policy gradients under a Lagrangian-relaxed objective,' yet the methods describe a zero-parameter prompt-based controller that approximates the policy directly with the LLM and enforces the composite reward plus constraints at inference time without any gradient-based parameter updates. This mismatch is load-bearing for the central methodological claim and requires explicit reconciliation (e.g., clarifying that the MDP is a conceptual model rather than a trained policy).

minor comments (2)

[§4] §4 (Experiments): The description of baseline implementations could be expanded to include exact prompt templates and hyperparameter settings used for fair comparison.
[Table 1 and Figure 3] Table 1 and Figure 3: Axis labels and legend entries should explicitly state the metric (e.g., accuracy vs. cost) and the exact fraction of calls to the strongest agent to improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The feedback highlights an important point of potential confusion in how the methodological contribution is presented. We address the major comment below and will revise the manuscript to resolve the inconsistency.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: The abstract claims the controller is optimized 'via policy gradients under a Lagrangian-relaxed objective,' yet the methods describe a zero-parameter prompt-based controller that approximates the policy directly with the LLM and enforces the composite reward plus constraints at inference time without any gradient-based parameter updates. This mismatch is load-bearing for the central methodological claim and requires explicit reconciliation (e.g., clarifying that the MDP is a conceptual model rather than a trained policy).

Authors: We agree that the current wording in the abstract creates a misleading impression. The finite-horizon MDP formulation is used strictly as a conceptual model to derive the structure of the composite reward function and the explicit agent-utilization constraints. In the actual implementation, the controller is realized as a zero-parameter, prompt-based system: the LLM directly approximates the policy by generating decisions at inference time, while the reward components and constraints are enforced through carefully designed prompts rather than through any gradient-based optimization or Lagrangian relaxation during training. No policy-gradient updates or parameter optimization occur. We will revise the abstract (and the opening of the Methods section) to explicitly state that the MDP serves as a design framework rather than a trained policy, and that the controller is implemented via prompting without gradient-based training. This change will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper formulates multi-agent coordination as a finite-horizon MDP with agent-utilization constraints and a composite reward, then implements the controller as a prompt-based LLM policy (zero trainable parameters) that enforces the reward and constraints directly at inference time. This approach does not involve fitting parameters to the evaluation benchmarks or renaming fitted quantities as predictions; the reported outperformance is measured on held-out reasoning tasks with explicit ablations and cross-benchmark checks. No load-bearing step reduces by construction to the inputs via self-definition, self-citation chains, or ansatz smuggling. The central claim therefore rests on independent empirical evidence rather than tautological equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5493 in / 1248 out tokens · 58670 ms · 2026-05-12T01:05:39.055654+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate this process as a finite-horizon Markov Decision Process (MDP) with explicit agent-utilization constraints, design a composite reward for controller decisions across turns, and optimize the controller via policy gradients under a Lagrangian-relaxed objective.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

rt := R(st,at) = 0.5 Rr(st,at) + 0.5 Rv(st,at)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 8 internal anchors

[1]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Routerdc: Query-based router by dual contrastive learning for assembling large language models.Advances in Neural Information Processing Systems, 37:66305–66328, 2024

Shuhao Chen, Weisen Jiang, Baijiong Lin, James Kwok, and Yu Zhang. Routerdc: Query-based router by dual contrastive learning for assembling large language models.Advances in Neural Information Processing Systems, 37:66305–66328, 2024

work page 2024
[3]

Hybrid LLM: Cost-efficient and quality-aware query routing

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality- aware query routing.arXiv preprint arXiv:2404.14618, 2024

work page arXiv 2024
[4]

Aurélien Garivier and Eric Moulines

Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Del Car- men Hipolito Garcia, Menglin Xia, Laks VS Lakshmanan, Qingyun Wu, and Victor R¨uhle. Best- route: Adaptive llm routing with test-time optimal compute.arXiv preprint arXiv:2506.22716, 2025

work page arXiv 2025
[5]

Acc-collab: An actor-critic approach to multi-agent llm collaboration.arXiv preprint arXiv:2411.00053, 2024

Andrew Estornell, Jean-Francois Ton, Yuanshun Yao, and Yang Liu. Acc-collab: An actor-critic approach to multi-agent llm collaboration.arXiv preprint arXiv:2411.00053, 2024

work page arXiv 2024
[6]

Omni-math: A universal olympiad level mathematic benchmark for large language models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models.arXiv preprint arXiv:2410.07985, 2024

work page arXiv 2024
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021

work page 2021
[9]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations, 2023

work page 2023
[10]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

arXiv preprint arXiv:2511.02755 , year =

Bowen Jin, TJ Collins, Donghan Yu, Mert Cemri, Shenao Zhang, Mengyu Li, Jay Tang, Tian Qin, Zhiyang Xu, Jiarui Lu, et al. Controlling performance and budget of a centralized multi-agent llm system with reinforcement learning.arXiv preprint arXiv:2511.02755, 2025

work page arXiv 2025
[12]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 2022

work page 2022
[14]

arXiv preprint arXiv:2410.02189 , year=

Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. Agent-oriented planning in multi-agent systems.arXiv preprint arXiv:2410.02189, 2024

work page arXiv 2024
[15]

Camel: Communicative agents for” mind” exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for” mind” exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023. 10

work page 2023
[16]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

work page 2024
[17]

A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

work page 2024
[18]

Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025

work page arXiv 2025
[19]

Routing to the expert: Efficient reward-guided ensemble of large language models

Keming Lu, Hongyi Yuan, Runji Lin, Junyang Lin, Zheng Yuan, Chang Zhou, and Jingren Zhou. Routing to the expert: Efficient reward-guided ensemble of large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1964–1974, 2024

work page 2024
[20]

Omnirouter: Budget and performance controllable multi-llm routing.ACM SIGKDD Explorations Newsletter, 27(2):107–116, 2025

Kai Mei, Wujiang Xu, Minghao Guo, Shuhang Lin, and Yongfeng Zhang. Omnirouter: Budget and performance controllable multi-llm routing.ACM SIGKDD Explorations Newsletter, 27(2):107–116, 2025

work page 2025
[21]

Malt: Improving reasoning with multi-agent llm training.arXiv preprint arXiv:2412.01928, 2024

Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Ivan Laptev, Philip HS Torr, Fabio Pizzati, Ronald Clark, and Christian Schroeder de Witt. Malt: Improving reasoning with multi-agent llm training.arXiv preprint arXiv:2412.01928, 2024

work page arXiv 2024
[22]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Online multi-llm selection via contextual bandits under unstructured context evolution

Manhin Poon, XiangXiang Dai, Xutong Liu, Fang Kong, John CS Lui, and Jinhang Zuo. Online multi-llm selection via contextual bandits under unstructured context evolution. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 24855–24863, 2026

work page 2026
[24]

John Wiley & Sons, 2014

Martin L Puterman.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014
[25]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174–15186, 2024

work page 2024
[26]

Navigating complexity: Orchestrated problem solving with multi-agent llms.arXiv preprint arXiv:2402.16713, 2024

Sumedh Rasal and Evi J Hauer. Navigating complexity: Orchestrated problem solving with multi-agent llms.arXiv preprint arXiv:2402.16713, 2024

work page arXiv 2024
[27]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015
[28]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Route-and-reason: Scaling large language model reasoning with reinforced model router.arXiv preprint arXiv:2506.05901, 2025

Chenyang Shao, Xinyang Liu, Yutang Lin, Fengli Xu, and Yong Li. Route-and-reason: Scaling large language model reasoning with reinforced model router.arXiv preprint arXiv:2506.05901, 2025

work page arXiv 2025
[30]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025. 11

work page 2025
[32]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998

work page 1998
[33]

Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023

Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023

work page arXiv 2023
[34]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

work page 2024
[35]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Router-r1: Teaching llms multi-round routing and aggregation via reinforcement learning

Haozhen Zhang, Tao Feng, and Jiaxuan You. Router-r1: Teaching llms multi-round routing and aggregation via reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[37]

TX t=1 γt−1R(st,a t)− TX t=1 KX k=1 ξk 1(f(a t) =k) # . (9) Under the reward definition in (1) andγ= 1, Problem (9) simplifies to: maximize θ Eq∼D Eat∼πθ(·|st)

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. InFindings of the association for computational linguistics: NAACL 2024, pages 2299–2314, 2024. 12 Appendix A Further Experiments 14 A.1 Other Controller Choices . . . . . . ...

work page 2024
[38]

Calculate the slope using the points (2, -5) and (p, -14): slope= −14−(−5) p−2 = −9 p−2

work page
[39]

Calculate the slope using the points (2, -5) and (p+2, -17): slope= −17−(−5) (p+ 2)−2 = −12 p

work page
[40]

Set the slopes equal to each other and solve forp: −9 p−2 = −12 p Cross-multiplying gives: −9p=−12(p−2) −9p=−12p+ 24 3p= 24 p= 8

work page
[41]

Verify the equation of the line using the point (2, -5) and the slope− 3 2 : y+ 5 =− 3 2 (x−2) y=− 3 2 x+ 3−5 y=− 3 2 x−2

work page
[42]

Find the value ofqwhenx= 13: q=− 3 2 (13)−2 =− 39 2 −2 =− 39 2 − 4 2 =− 43 2 =−21.5

work page
[43]

Turn 1 (Controller): <thinking>This is a simple problem involving rationalizing the denominator

Calculatep+q: p+q= 8 + (−21.5) =−13.5 </checking> <verdict>True</verdict> Example 2: Exit in 3 Turns Question:Rationalize the denominator: 1 2 √ 7 . Turn 1 (Controller): <thinking>This is a simple problem involving rationalizing the denominator. It is a basic algebraic manipulation problem that does not require advanced mathematical knowledge or complex p...

work page
[45]

To rationalize the denominator, we multiply the numerator and the denominator by2 √ 7

work page
[46]

This gives us 2 √ 7 2 √ 7·2 √ 7 = √ 7 2·7 = √ 7 14

work page
[47]

We need to route the problem to a stronger agent to refine the solution

The proposed solution seems to have an error in step 4. We need to route the problem to a stronger agent to refine the solution. Qwen2.5-7B-Instruct is good choice, which is stronger than Qwen2.5-1.5B-Instruct. </checking> <verdict>False</verdict> <model>Qwen2.5-7B-Instruct</model> Response from Qwen2.5-7B-Instruct:To rationalize the denominator of the ex...

work page
[48]

The given expression is 1 2 √ 7

work page
[49]

To rationalize the denominator, we multiply the numerator and the denominator by √ 7

work page
[50]

This gives us √ 7 2 √ 7· √ 7 = √ 7 2·7 = √ 7 14

work page
[51]

</checking> <verdict>True</verdict> 24

The proposed solution is correct. </checking> <verdict>True</verdict> 24

work page