arxiv: 2605.05007 · v1 · submitted 2026-05-06 · 💻 cs.AI

Recognition: 3 theorem links

· Lean Theorem

Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

Cheng Yang, Hanqing Wang, Haotong Xie, Jiahao Yuan, Qibing Ren, Siru Zhong, Siyu Zhang, Tao Yu, Usman Naseem, Xinlei Yu, Yifan Wu, Yifu Guo, Yuxin Wu, Zhiqing Cui

Pith reviewed 2026-05-08 18:00 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent LLM systemsselective delegationorchestration policyreinforcement learningtask decompositionagent routingworkflow optimizationcost efficiency

0 comments

The pith

A learned policy for selective task decomposition and model routing in LLM agents reaches 77% accuracy while cutting per-query cost by an order of magnitude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current multi-agent LLM systems lock into either flat single-model routing or fixed manual decompositions, so they cannot jointly tune how deep to split a task, which worker to assign, and how much inference budget to spend. Uno-Orchestra replaces those rigid choices with one policy trained by reinforcement learning on trajectories drawn from actual worker runs; the policy decides both whether to decompose and which admissible model-primitive pair to use for each piece. A sympathetic reader would care because this unification removes the need for separate hand-engineering steps and directly trades off accuracy against cost under a single objective. If the policy generalizes, agent systems could handle mixed workloads more reliably without the usual steep rise in compute.

Core claim

Uno-Orchestra is a unified orchestration policy that selectively decomposes a task and dispatches each subtask to an admissible (model, primitive) pair, with both decisions learned together from curated RL trajectories grounded in real worker interactions. Against 22 baselines on a 13-benchmark suite spanning math, code, knowledge, long-context, and agentic tool-use, Uno-Orchestra reaches 77.0% macro pass@1, roughly 16% above the strongest workflow baseline, at roughly an order of magnitude lower per-query cost.

What carries the argument

The central mechanism is a single learned policy that jointly selects decomposition depth and worker assignment by choosing from admissible (model, primitive) pairs, trained end-to-end on interaction trajectories rather than through separate hand-crafted rules.

If this is right

Decomposition depth and model choice become dynamic per query instead of fixed in advance.
The system can reserve deeper or stronger workers only for tasks where they measurably improve outcomes.
Overall compute per query drops because unnecessary subtasks and overpowered models are skipped.
The same policy applies without redesign across math, code, knowledge, long-context, and tool-use domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If interaction data can be collected at scale, many existing agent frameworks could replace their static routers with a learned one trained on their own logs.
The approach suggests that data-driven routing may matter more for efficiency than simply scaling model size or adding more agents.
One could test whether the policy remains effective when the set of available models changes over time, by retraining only on new trajectories.
Extending the method to online adaptation might let the system refine its routing decisions as it encounters novel task distributions.

Load-bearing premise

The curated RL trajectories from past worker interactions are representative of future tasks and the policy will generalize without introducing systematic decomposition errors or hidden costs.

What would settle it

Evaluating the same policy on a new suite of tasks outside the original 13 benchmarks that require decomposition patterns or tool combinations absent from the training trajectories, and measuring whether the accuracy and cost advantages persist or collapse.

Figures

Figures reproduced from arXiv: 2605.05007 by Cheng Yang, Hanqing Wang, Haotong Xie, Jiahao Yuan, Qibing Ren, Siru Zhong, Siyu Zhang, Tao Yu, Usman Naseem, Xinlei Yu, Yifan Wu, Yifu Guo, Yuxin Wu, Zhiqing Cui.

**Figure 1.** Figure 1: LLM orchestration paradigms: (A) model router, (B) hierarchical orchestra, (C) UNOORCHESTRA (ours). In practice, however, both threads degenerate into limited patterns. (1) Single-call routers commit at query granularity: they pick one expert per query, miss the parallel structure of complex tasks, and the few budgeted variants [45, 16] live in a flat single-call action space that cannot trade decompositi… view at source ↗

**Figure 2.** Figure 2: Overview of UNO-ORCHESTRA. (A) Multi-turn selective delegation: at each turn the orchestrator decides when, where, and how to decompose the task, configures one (model, primitive) routing pair per subtask, and dispatches the subtasks to heterogeneous workers; observations feed back as conditioning context for the next turn. (B) Two-stage training: Stage 1 (SFT) distils teacher trajectories grounded in real… view at source ↗

**Figure 3.** Figure 3: Accuracy and efficiency overview on the 13-benchmark suite. Panel (a) plots macro pass@1 view at source ↗

**Figure 4.** Figure 4: Performance under three train-test distribution-shift regimes: (a) in-domain, (b) near view at source ↗

**Figure 5.** Figure 5: Router-backbone comparison: pass@1 (dashed) and pass@2 (solid) across five capability domains. Router-backbone size. The router itself is a Qwen2.5-7B-Instruct in the main configuration view at source ↗

**Figure 6.** Figure 6: Token-stream schematics for the four trajectory behaviour modes of view at source ↗

read the original abstract

Large language model (LLM) multi-agent systems typically rely on rigid orchestration, committing either to flat per-query routing or to hand-engineered task decomposition, so decomposition depth, worker choice, and inference budget are not jointly optimized under one objective. We introduce Uno-Orchestra, a unified orchestration policy that selectively decomposes a task and dispatches each subtask to an admissible (model, primitive) pair, with both decisions learned together from curated RL trajectories grounded in real worker interactions. Against 22 baselines on a 13-benchmark suite spanning math, code, knowledge, long-context, and agentic tool-use, Uno-Orchestra reaches 77.0% macro pass@1, roughly 16% above the strongest workflow baseline, at roughly an order of magnitude lower per-query cost, advancing the accuracy-efficiency frontier of selective delegation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Uno-Orchestra, a unified orchestration policy for LLM multi-agent systems that jointly learns selective task decomposition and dispatch to admissible (model, primitive) pairs from curated RL trajectories grounded in real worker interactions. It evaluates the approach against 22 baselines across a 13-benchmark suite covering math, code, knowledge, long-context, and agentic tool-use tasks, reporting 77.0% macro pass@1 (roughly 16% above the strongest workflow baseline) at an order of magnitude lower per-query cost.

Significance. If the performance and cost claims hold under rigorous validation, the work would meaningfully advance the accuracy-efficiency frontier for selective delegation in multi-agent LLM systems by replacing rigid or hand-engineered orchestration with a single learned policy. The grounding in real interaction trajectories and joint optimization of decomposition depth and routing are notable strengths that could influence future agent routing designs.

major comments (3)

[§4] §4 (Experiments) and Table 2: the headline 77.0% macro pass@1 and +16% lift over the strongest baseline are presented without reported statistical tests, variance across runs, or explicit definitions of the 22 baselines and their configurations; this makes it impossible to assess whether the gains are robust or attributable to the learned policy rather than implementation details.
[§3.2] §3.2 (RL Training) and §5 (Ablations): the claim that the policy generalizes from curated trajectories to the full 13-benchmark diversity (math, code, long-context, tool-use) lacks evidence on trajectory coverage, decomposition pattern distribution, or out-of-distribution performance; if the training distribution is narrower, the reported cost reduction and accuracy lift could be overstated.
[Table 3] Table 3 (Cost Analysis): the order-of-magnitude per-query cost reduction is reported as a key result, but without breakdown of hidden orchestration overhead, token usage for decomposition decisions, or sensitivity to model choice, the efficiency claim cannot be fully evaluated against the accuracy gains.

minor comments (2)

[Abstract] The abstract and introduction use 'macro pass@1' without an early definition or reference to how it aggregates across the heterogeneous benchmarks.
[Figure 1] Figure 1 (system diagram) would benefit from explicit annotation of the learned policy components (decomposition decision and dispatch) to match the textual description in §2.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for improving the clarity and rigor of our experimental claims. We address each major comment below and commit to revisions that will strengthen the manuscript without misrepresenting our results.

read point-by-point responses

Referee: [§4] §4 (Experiments) and Table 2: the headline 77.0% macro pass@1 and +16% lift over the strongest baseline are presented without reported statistical tests, variance across runs, or explicit definitions of the 22 baselines and their configurations; this makes it impossible to assess whether the gains are robust or attributable to the learned policy rather than implementation details.

Authors: We agree that statistical tests, variance reporting, and explicit baseline definitions are necessary for assessing robustness. In the revised manuscript, we will add standard deviations computed over multiple independent runs (minimum 5 seeds), include paired statistical significance tests (e.g., Wilcoxon signed-rank) for the reported lifts, and expand §4 plus the appendix with precise configurations, hyperparameters, and implementation details for all 22 baselines. These additions will make it possible to attribute gains more clearly to the learned policy. revision: yes
Referee: [§3.2] §3.2 (RL Training) and §5 (Ablations): the claim that the policy generalizes from curated trajectories to the full 13-benchmark diversity (math, code, long-context, tool-use) lacks evidence on trajectory coverage, decomposition pattern distribution, or out-of-distribution performance; if the training distribution is narrower, the reported cost reduction and accuracy lift could be overstated.

Authors: The trajectories were curated from real worker interactions spanning the benchmark categories, but we acknowledge the need for explicit coverage analysis. We will revise §3.2 to include quantitative statistics on trajectory task-type distribution and decomposition-depth patterns, and extend §5 with additional held-out task evaluations to probe out-of-distribution behavior. This will either further support the generalization claims or allow us to qualify them appropriately. revision: yes
Referee: [Table 3] Table 3 (Cost Analysis): the order-of-magnitude per-query cost reduction is reported as a key result, but without breakdown of hidden orchestration overhead, token usage for decomposition decisions, or sensitivity to model choice, the efficiency claim cannot be fully evaluated against the accuracy gains.

Authors: We concur that a finer-grained cost breakdown is required to substantiate the efficiency results. In the revision we will expand Table 3 and its discussion to report orchestration overhead (policy inference tokens), token counts specifically for decomposition decisions, sensitivity sweeps over orchestrator model choices, and per-benchmark cost-accuracy pairs. These details will enable a more transparent comparison of net gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL policy evaluated on independent benchmarks

full rationale

The paper presents an RL-trained orchestration policy whose decisions are learned from curated trajectories of real worker interactions and then evaluated on a 13-benchmark suite. No equations, fitted parameters, or self-citations are shown that would make the reported 77% pass@1 or cost reduction equivalent to the training data by construction. The derivation chain is therefore self-contained: training data and test benchmarks remain distinct, and performance gains are measured rather than defined.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit mathematical assumptions, free parameters, or invented entities; the approach is described purely at the level of learned policy and empirical results.

pith-pipeline@v0.9.0 · 5484 in / 1153 out tokens · 81761 ms · 2026-05-08T18:00:31.013748+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/AlphaCoordinateFixation, Cost/FunctionalEquation (J-cost) J_uniquely_calibrated_via_higher_derivative unclear
PPO clipping ϵ=0.2, KL against a frozen router reference β=10⁻³, terminal cost blending α=0.1 ... at most T_max=8 orchestrator turns

Reference graph

Works this paper leans on

84 extracted references · 40 canonical work pages · 17 internal anchors

[1]

The orchestration of multi-agent systems: Architectures, protocols, and enterprise adoption

Apoorva Adimulam, Rajesh Gupta, and Sumit Kumar. The orchestration of multi-agent systems: Architectures, protocols, and enterprise adoption.arXiv preprint arXiv:2601.13671, 2026

work page arXiv 2026
[2]

Automix: Automatically mixing language models

Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, et al. Automix: Automatically mixing language models.arXiv preprint arXiv:2310.12963, 2023

work page arXiv 2023
[3]

Ai-mo/aimo-validation-aime

AI-MO. Ai-mo/aimo-validation-aime. https://huggingface.co/datasets/AI-MO/ aimo-validation-aime, 2024

2024
[4]

Claude code: Create custom subagents

Anthropic. Claude code: Create custom subagents. Anthropic Documentation, 2026. URL https://docs.anthropic.com/en/docs/claude-code/sub-agents

2026
[5]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page Pith review arXiv 2021
[6]

Semantic parsing on freebase from question-answer pairs

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. InProceedings of the 2013 conference on empirical methods in natural language processing, pages 1533–1544, 2013

2013
[7]

Piqa: Reasoning about phys- ical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

2020
[8]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review arXiv 2021
[9]

Routerdc: Query-based router by dual contrastive learning for assembling large language models.Advances in Neural Information Processing Systems, 37:66305–66328, 2024

Shuhao Chen, Weisen Jiang, Baijiong Lin, James Kwok, and Yu Zhang. Routerdc: Query-based router by dual contrastive learning for assembling large language models.Advances in Neural Information Processing Systems, 37:66305–66328, 2024

2024
[10]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page Pith review arXiv 2018
[11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review arXiv 2021
[12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page Pith review arXiv 2025
[13]

Multi-agent collaboration via evolving orchestration, arXiv preprint arXiv:2505.19591, 2025

Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zihao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, et al. Multi-agent collaboration via evolving orchestration.arXiv preprint arXiv:2505.19591, 2025. 10

work page arXiv 2025
[14]

Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153, 2025

Zheng Ding and Weirui Ye. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153, 2025

work page arXiv 2025
[15]

Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gard- ner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Paper...

2019
[16]

GraphRAG-Router: Learning Cost-Efficient Routing over GraphRAGs and LLMs with Reinforcement Learning

Dongzhe Fan, Chuanhao Ji, Zimu Wang, Tong Chen, and Qiaoyu Tan. Graphrag-router: Learning cost-efficient routing over graphrags and llms with reinforcement learning.arXiv preprint arXiv:2604.16401, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems.arXiv preprint arXiv:2508.07407, 2025

Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, et al. A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems.arXiv preprint arXiv:2508.07407, 2025

work page arXiv 2025
[18]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

work page internal anchor Pith review arXiv 2025
[19]

A., Tihanyi, N., and Debbah, M

Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678, 2025

work page arXiv 2025
[20]

Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021

2021
[21]

Gemini 3 Pro model card

Google DeepMind. Gemini 3 Pro model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf, 2025

2025
[22]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review arXiv 2025
[23]

Folio: Natural language reasoning with first-order logic

Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Wenfei Zhou, James Coady, David Peng, Yujie Qiao, Luke Benson, et al. Folio: Natural language reasoning with first-order logic. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22017–22031, 2024

2024
[24]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review arXiv 2009
[25]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review arXiv 2021
[26]

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

2020
[27]

open-r1/codeforces-cots

Hugging Face Open-R1. open-r1/codeforces-cots. https://huggingface.co/datasets/ open-r1/codeforces-cots, 2025

2025
[28]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review arXiv 2024
[29]

Tree search for llm agent reinforcement learning, 2026

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240, 2025

work page arXiv 2025
[30]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023. 11

work page internal anchor Pith review arXiv 2023
[31]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017

2017
[32]

Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

2019
[33]

LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing

Hao Li, Yiqun Zhang, Zhaoyan Guo, Chenxu Wang, Shengji Tang, Qiaosheng Zhang, Yang Chen, Biqing Qi, Peng Ye, Lei Bai, et al. Llmrouterbench: A massive benchmark and unified framework for llm routing.arXiv preprint arXiv:2601.07206, 2026

work page arXiv 2026
[34]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

2024
[35]

TACO: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

work page arXiv 2023
[36]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

2023
[37]

Logiqa 2.0–an improved dataset for logical reasoning in natural language understanding

Hanmeng Liu, Jian Liu, Leyang Cui, Zhiyang Teng, Nan Duan, Ming Zhou, and Yue Zhang. Logiqa 2.0–an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2947–2962, 2023

2023
[38]

Task-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios

Hui Liu, Bin Zou, Kecheng Chen, Jie Liu, Wenya Wang, and Haoliang Li. Task-aware llm routing with multi-level task-profile-guided data synthesis for cold-start scenarios.arXiv preprint arXiv:2604.09377, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

arXiv preprint arXiv:2409.00920

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. Toolace: Winning the points of llm function calling.arXiv preprint arXiv:2409.00920, 2024

work page arXiv 2024
[40]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

2023
[41]

Routing with generated data: Annotation-free llm skill estimation and expert selection.arXiv preprint arXiv:2601.09692, 2026

Tianyi Niu, Justin Chih-Yao Chen, Genta Indra Winata, Shi-Xiong Zhang, Supriyo Chakraborty, Sambit Sahu, Yue Zhang, Elias Stengel-Eskin, and Mohit Bansal. Routing with generated data: Annotation-free llm skill estimation and expert selection.arXiv preprint arXiv:2601.09692, 2026

work page arXiv 2026
[42]

Explainable Model Routing for Agentic Workflows

Mika Okamoto, Ansel Kaplan Erol, and Mark Riedl. Explainable model routing for agentic workflows.arXiv preprint arXiv:2604.03527, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, 2022

2022
[44]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, et al. Quality: Question answering with long input texts, yes! InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ...

2022
[45]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu

Cheng Qian, Zuxin Liu, Shirley Kokane, Akshara Prabhakar, Jielin Qiu, Haolin Chen, Zhi- wei Liu, Heng Ji, Weiran Yao, Shelby Heinecke, et al. xrouter: Training cost-aware llms orchestration system via reinforcement learning.arXiv preprint arXiv:2510.08439, 2025

work page arXiv 2025
[46]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review arXiv 2023
[47]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, 2023

2023
[48]

Zero: Memory optimiza- tions toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

2020
[49]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review arXiv 2023
[50]

Aorchestra: Automating sub-agent creation for agentic orchestration.arXiv preprint arXiv:2602.03786, 2026

Jianhao Ruan, Zhihao Xu, Yiran Peng, Fashen Ren, Zhaoyang Yu, Xinbing Liang, Jinyu Xiang, Yongru Chen, Bang Liu, Chenglin Wu, et al. Aorchestra: Automating sub-agent creation for agentic orchestration.arXiv preprint arXiv:2602.03786, 2026

work page arXiv 2026
[51]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

2021
[52]

Social iqa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4463–4473, 2019

2019
[53]

Thread: Thinking deeper with recursive spawning

Philip Schroeder, Nathaniel W Morgan, Hongyin Luo, and James Glass. Thread: Thinking deeper with recursive spawning. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8418–8442, 2025

2025
[54]

Proximal policy optimization algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. 2017

2017
[55]

Route-and-reason: Scaling large language model reasoning with reinforced model router.arXiv preprint arXiv:2506.05901, 2025

Chenyang Shao, Xinyang Liu, Yutang Lin, Fengli Xu, and Yong Li. Route-and-reason: Scaling large language model reasoning with reinforced model router.arXiv preprint arXiv:2506.05901, 2025

work page arXiv 2025
[56]

Route-and-reason: Energy- efficient scaling of llm reasoning via reinforced model routing

Chenyang Shao, Xinyang Liu, Yutang Lin, Fengli Xu, and Yong Li. Route-and-reason: Energy- efficient scaling of llm reasoning via reinforced model routing. InProceedings of the ACM Web Conference 2026, pages 9551–9562, 2026

2026
[57]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review arXiv 2024
[58]

arXiv preprint arXiv:2511.21689 , year=

Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, et al. Toolorchestra: Elevating intelligence via efficient model and tool orchestration.arXiv preprint arXiv:2511.21689, 2025

work page arXiv 2025
[59]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Compu- tational Linguistics: ACL 2023, pages 13003–13051, 2023. 13

2023
[60]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019

2019
[61]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review arXiv 2024
[62]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review arXiv 2026
[63]

Terminal-bench: A benchmark for ai agents in terminal environments, 2025

TTB Team. Terminal-bench: A benchmark for ai agents in terminal environments, 2025

2025
[64]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

2022
[65]

Icl-router: In-context learned model representations for llm routing

Chenxu Wang, Hao Li, Yiqun Zhang, Linyao Chen, Jianhao Chen, Ping Jian, Qiaosheng Zhang, and Shuyue Hu. Icl-router: In-context learned model representations for llm routing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33413–33421, 2026

2026
[66]

arXiv preprint arXiv:2602.19672 , year=

Jiayu Wang, Yifei Ming, Zixuan Ke, Shafiq Joty, Aws Albarghouthi, and Frederic Sala. Skil- lorchestra: Learning to route agents via skill transfer.arXiv preprint arXiv:2602.19672, 2026

work page arXiv 2026
[67]

Crowdsourcing multiple choice science questions

Johannes Welbl, Nelson F Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. InProceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, 2017

2017
[68]

Atlas: Orchestrating heterogeneous models and tools for multi-domain complex reasoning.arXiv preprint arXiv:2601.03872, 2026

Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen, Shuai Zhang, Zhengqi Wen, and Jianhua Tao. Atlas: Orchestrating heterogeneous models and tools for multi-domain complex reasoning.arXiv preprint arXiv:2601.03872, 2026

work page arXiv 2026
[69]

WideSeek-R1: Exploring width scaling for broad information seeking via multi-agent reinforcement learning.arXiv preprint arXiv:2602.04634, 2026

Zelai Xu, Zhexuan Xu, Ruize Zhang, Chunyang Zhu, Shi Yu, Weilin Liu, Quanlu Zhang, Wenbo Ding, Chao Yu, and Yu Wang. Wideseek-r1: Exploring width scaling for broad information seeking via multi-agent reinforcement learning.arXiv preprint arXiv:2602.04634, 2026

work page arXiv 2026
[70]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

2018
[71]

Survey on Evaluation of LLM-based Agents

Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli-Scheuer. Survey on evaluation of llm-based agents.arXiv preprint arXiv:2503.16416, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

How do decoder-only llms perceive users? rethinking attention masking for user representation learning.arXiv preprint arXiv:2602.10622, 2026

Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Yang Chen, Xiaotong Lin, Wuliang Huang, Ziyi Gao, Xing Fu, Yu Cheng, et al. How do decoder-only llms perceive users? rethinking attention masking for user representation learning.arXiv preprint arXiv:2602.10622, 2026

work page arXiv 2026
[73]

Masrouter: Learning to route llms for multi-agent systems

Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, and Yiyan Qi. Masrouter: Learning to route llms for multi-agent systems. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15549–15572, 2025

2025
[74]

Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment

Siliang Zeng, Quan Wei, William Brown, Oana Frunza, Yuriy Nevmyvaka, Yang Katie Zhao, and Mingyi Hong. Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment. InICML 2025 Workshop on Computer Use Agents, 2025

2025
[75]

Evoroute: Experience-driven self-routing llm agent systems.arXiv preprint arXiv:2601.02695, 2026

Guibin Zhang, Haiyang Yu, Kaiming Yang, Bingli Wu, Fei Huang, Yongbin Li, and Shuicheng Yan. Evoroute: Experience-driven self-routing llm agent systems.arXiv preprint arXiv:2601.02695, 2026. 14

work page arXiv 2026
[76]

Router-r1: Teaching llms multi-round routing and aggregation via reinforcement learning

Haozhen Zhang, Tao Feng, and Jiaxuan You. Router-r1: Teaching llms multi-round routing and aggregation via reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[77]

Agentorchestra: A hierarchical multi-agent framework for general-purpose task solving.arXiv e-prints, pages arXiv–2506, 2025

Wentao Zhang, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui Zhou, and Bo An. Agentorchestra: A hierarchical multi-agent framework for general-purpose task solving.arXiv e-prints, pages arXiv–2506, 2025

2025
[78]

same task and same turn

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Gptswarm: Language agents as optimizable graphs. InForty-first International Conference on Machine Learning, 2024. A Experimental Details A.1 Evaluation reporting conventions The main suite comprises the thirteen public benchmarks enumerated in §5. LLMR...

2024
[79]

If not, emit <final_answer> directly

Decide whether the task warrants decomposition. If not, emit <final_answer> directly
[80]

Otherwise, write a <plan> that exposes parallel substructure where it exists; encode strict ordering only viadepends_on

Showing first 80 references.