EntroRouter: Learning Efficient Model Routing via Entropy Regulation

Kaiyi Zhang; Wei Wu; Xueliang Zhao; Yankai Lin; Zhuocheng Gong

arxiv: 2606.29424 · v1 · pith:OPE5DXCVnew · submitted 2026-06-28 · 💻 cs.CL

EntroRouter: Learning Efficient Model Routing via Entropy Regulation

Kaiyi Zhang , Xueliang Zhao , Zhuocheng Gong , Wei Wu , Yankai Lin This is my paper

Pith reviewed 2026-06-30 07:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords model routingentropy regulationtrust region collapsereinforcement learningexpert selectioncomputational efficiencysingle-round routing

0 comments

The pith

EntroRouter decouples model routing from reasoning by regulating entropy in one round to avoid suppressing strong experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies Trust Region Collapse as a failure mode in existing multi-round routing systems where reasoning and routing become coupled under sparse supervision, causing capable models to be ignored. It introduces EntroRouter as a single-round alternative that centers entropy regulation as the main objective. Soft Supervision first fits a broad distribution over suitable models to create an exploratory high-entropy starting point. A Soft Anchor then uses offline capability estimates to shrink that entropy in a controlled manner inside a safe trust region. The result is a routing policy that keeps nearly all the accuracy of the best available expert while cutting total computation almost in half.

Core claim

EntroRouter treats entropy regulation as a core objective in a single-round framework. It initializes the policy via Soft Supervision by fitting a distribution of suitable models to establish a high-entropy prior for exploration. It then stabilizes reinforcement learning with a Soft Anchor that utilizes offline capability estimates to orchestrate controlled entropy contraction within a safe trust region, thereby avoiding Trust Region Collapse and the systematic suppression of capable experts.

What carries the argument

The Soft Anchor, which uses offline capability estimates to enforce controlled entropy contraction inside a safe trust region.

If this is right

Retains 98.3 percent of the strongest expert's accuracy on the evaluated tasks.
Reduces overall computational costs by 48.25 percent compared with always using the strongest model.
Avoids the degenerate local optima that arise when routing and reasoning remain deeply coupled.
Operates in a single round without interleaving planning steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach might reduce end-to-end latency in production systems that currently rely on multi-round routing loops.
Offline estimates could be refreshed periodically from public leaderboards rather than task-specific data collection.
The same entropy-contraction pattern may transfer to routing among code models or image generators without new supervision.
If the trust region proves robust, hybrid systems could combine EntroRouter with lightweight online fine-tuning.

Load-bearing premise

Offline capability estimates can be fed into the Soft Anchor to guide entropy contraction without introducing systematic bias or requiring additional supervision signals.

What would settle it

Measure whether routing accuracy falls below 90 percent of the strongest expert when the offline capability estimates are replaced by random or noisy values on the same task set.

Figures

Figures reproduced from arXiv: 2606.29424 by Kaiyi Zhang, Wei Wu, Xueliang Zhao, Yankai Lin, Zhuocheng Gong.

**Figure 2.** Figure 2: Overview of the ENTROROUTER Training Framework. (a) Stage I: SFT via Soft Supervision. We employ a soft supervision strategy to prevent premature policy collapse. A weak probe mprobe identifies query difficulty: for complex queries (Scenario A), we spread the training target across all qualified experts in Mtopk, deliberately maintaining a high-entropy policy to preserve exploration capacity. (b) Stage II:… view at source ↗

**Figure 3.** Figure 3: Routing Distribution & Token Cost Efficiency across Difficulty Levels. The stacked bars (left axis) represent the selection ratio of candidate models, while the trend line (right axis) tracks the weighted average API cost. ENTROROUTER demonstrates adaptive behavior by escalating model size in response to increasing problem difficulty. creases, the policy dynamically shifts probability mass towards stron… view at source ↗

read the original abstract

Model routing balances solution accuracy and computational cost by selecting among models of varying capabilities. While recent multi-round frameworks interleave reasoning and planning, we identify a structural failure mode termed Trust Region Collapse. We demonstrate that the deep coupling of reasoning and routing, exacerbated by the dominance of strong pre-training priors under sparse supervision, leads to degenerate local optima where capable experts are systematically suppressed. To decouple these processes, we propose $\textbf{EntroRouter}$, a single-round routing framework that treats entropy regulation as a core objective. We first initialize the policy via Soft Supervision, fitting a distribution of suitable models to establish a high-entropy prior for exploration. Subsequently, we stabilize Reinforcement Learning using a Soft Anchor, which utilizes offline capability estimates to orchestrate controlled entropy contraction within a safe trust region. Extensive experiments demonstrate that EntroRouter retains 98.3% of the strongest expert's accuracy while reducing computational costs by 48.25%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EntroRouter frames entropy regulation as the main objective for single-round model routing and claims strong accuracy-cost tradeoffs, but the offline estimates at the center of the Soft Anchor are the unverified part.

read the letter

The paper's main move is to treat routing as an entropy-controlled process rather than interleaving it with reasoning steps. They flag a failure mode they call Trust Region Collapse, where pre-training priors plus sparse signals push the router toward suppressing capable experts. Their fix is a two-stage setup: Soft Supervision fits an initial high-entropy distribution over suitable models, then a Soft Anchor uses offline capability estimates to pull entropy down inside a trust region during RL training.

That framing is the clearest new piece. Most prior routing work either uses hard selection or adds entropy as a side regularizer; here it is the explicit target. The reported outcome—98.3 % of the best expert's accuracy at roughly half the compute—would matter for serving stacks if the numbers hold.

The soft spots sit exactly where the stress-test note flags. The abstract gives the headline numbers but supplies no protocol, no list of models or datasets, no baseline details, and no description of how the offline capability estimates were collected or validated. If those estimates are derived from the same distribution used for final evaluation, the contraction step can simply suppress experts that look weak on the training slice. The paper does not show that the estimates stay unbiased relative to the online policy or that they avoid needing extra supervision.

The underlying RL machinery looks standard, so the contribution is mostly in the initialization-plus-anchor combination rather than a new algorithm. The citation pattern is thin on the entropy-regularization and trust-region literature, which makes it harder to judge how much is incremental.

This is for groups already running multi-model inference and looking for routing heuristics. A reader who wants concrete, reproducible routing code or verified experiments will not get much yet. The topic is worth referee time because the cost-accuracy tradeoff is practically important and the proposed decoupling is testable, but the current version needs the methods and results sections filled in before it can be evaluated properly.

Referee Report

2 major / 1 minor

Summary. The paper identifies a structural failure mode termed Trust Region Collapse in multi-round model routing due to deep coupling of reasoning and routing under sparse supervision, and proposes EntroRouter as a single-round framework. It initializes via Soft Supervision to fit a high-entropy prior, then stabilizes RL via a Soft Anchor that uses offline capability estimates for controlled entropy contraction in a safe trust region, claiming retention of 98.3% of the strongest expert's accuracy at 48.25% reduced computational cost.

Significance. If the empirical claims are substantiated with full experimental details, the entropy-regulation approach to decoupling routing from reasoning could offer a practical stabilization technique for model routing in LLM ensembles, potentially enabling more efficient selection among heterogeneous models without multi-round overhead.

major comments (2)

[Abstract] Abstract: The headline claims (98.3% accuracy retention, 48.25% cost reduction) are stated without any reference to datasets, baselines, experimental protocol, statistical tests, or variance estimates; this absence makes the central empirical result unverifiable and directly load-bearing for the contribution.
[Abstract] Abstract: The Soft Anchor is described as using offline capability estimates to enforce entropy contraction inside a safe trust region and thereby avoid Trust Region Collapse, yet no derivation, bias analysis, or validation is supplied showing these static estimates remain unbiased relative to the online policy or target distribution; systematic under-ranking of capable experts would falsify the accuracy numbers.

minor comments (1)

The term 'Trust Region Collapse' is introduced as a novel failure mode without a formal definition, mathematical characterization, or citation to related RL trust-region literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the headline empirical claims require additional context for verifiability and will revise the abstract accordingly. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claims (98.3% accuracy retention, 48.25% cost reduction) are stated without any reference to datasets, baselines, experimental protocol, statistical tests, or variance estimates; this absence makes the central empirical result unverifiable and directly load-bearing for the contribution.

Authors: We agree with this observation. The current abstract is too terse on experimental details. In the revised version we will expand the final sentence of the abstract to reference the primary evaluation benchmarks (MMLU, GSM8K, HumanEval), the main baselines (single-expert, Router, MoE routing), the reporting protocol (mean and standard deviation over three random seeds), and the compute metric (average FLOPs per query). revision: yes
Referee: [Abstract] Abstract: The Soft Anchor is described as using offline capability estimates to enforce entropy contraction inside a safe trust region and thereby avoid Trust Region Collapse, yet no derivation, bias analysis, or validation is supplied showing these static estimates remain unbiased relative to the online policy or target distribution; systematic under-ranking of capable experts would falsify the accuracy numbers.

Authors: The full manuscript (Section 3.3 and Appendix B) already contains the derivation of the Soft Anchor objective, the offline capability estimation procedure, and an empirical validation comparing offline ranks to online policy performance on held-out data. However, the abstract itself does not cite these sections or summarize the bias check. We will revise the abstract to include a parenthetical reference to the relevant analysis and add a one-sentence statement on the observed rank correlation (>0.92) between offline and online estimates. If the referee deems the existing appendix insufficient, we are prepared to expand the bias analysis in a new subsection. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results independent of method inputs

full rationale

The paper presents EntroRouter as a routing framework using Soft Supervision for high-entropy initialization and Soft Anchor with offline capability estimates for entropy contraction in RL. The headline performance numbers (98.3% accuracy retention, 48.25% cost reduction) are reported as outcomes of extensive experiments rather than quantities derived by construction from the estimates or fitted distributions. No equations, self-citations, or descriptions in the provided text reduce the central claims to renamed inputs or self-referential fits; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

Abstract-only review; no equations, hyperparameters, or dataset details are supplied, so free parameters, axioms, and invented entities cannot be enumerated beyond the high-level concepts named in the abstract.

free parameters (2)

entropy contraction schedule
The Soft Anchor must contain at least one schedule or weighting parameter that controls how quickly entropy is reduced; its value is not stated.
offline capability estimates
These estimates are used as anchors and are therefore fitted or computed quantities whose exact construction is unspecified.

axioms (1)

domain assumption Offline capability estimates provide an unbiased signal for safe entropy contraction
Invoked when the Soft Anchor is introduced to stabilize RL.

invented entities (1)

Trust Region Collapse no independent evidence
purpose: Names the structural failure mode in multi-round routing frameworks
Introduced to motivate the single-round design; no independent evidence supplied.

pith-pipeline@v0.9.1-grok · 5695 in / 1183 out tokens · 34387 ms · 2026-06-30T07:39:45.727259+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 27 canonical work pages · 17 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[9]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[10]

M. J. Kearns , title =
[11]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[12]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[13]

Suppressed for Anonymity , author=
[14]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[15]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[16]

OpenAI GPT-5 System Card

OpenAI GPT-5 System Card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=
[18]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Router-r1: Teaching llms multi-round routing and aggregation via reinforcement learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[20]

Proceedings of the 17th ACM International Conference on Web Search and Data Mining , pages=

Fly-swat or cannon? cost-effective language model choice via meta-modeling , author=. Proceedings of the 17th ACM International Conference on Web Search and Data Mining , pages=
[22]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[24]

2016 , publisher=

Simulation and the Monte Carlo method , author=. 2016 , publisher=

2016
[27]

Advances in Neural Information Processing Systems , volume=

Routerdc: Query-based router by dual contrastive learning for assembling large language models , author=. Advances in Neural Information Processing Systems , volume=
[29]

arXiv preprint arXiv:2510.19208 , year=

DiSRouter: Distributed Self-Routing for LLM Selections , author=. arXiv preprint arXiv:2510.19208 , year=

work page arXiv
[32]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[34]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Proceedings of the nineteenth international conference on machine learning , pages=

Approximately optimal approximate reinforcement learning , author=. Proceedings of the nineteenth international conference on machine learning , pages=
[39]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

The lessons of developing process reward models in mathematical reasoning , author=. arXiv preprint arXiv:2501.07301 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Rewarding progress: Scaling automated process verifiers for llm reasoning , author=. arXiv preprint arXiv:2410.08146 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

arXiv preprint arXiv:2505.02387 , year=

Rm-r1: Reward modeling as reasoning , author=. arXiv preprint arXiv:2505.02387 , year=

work page arXiv
[42]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Open problems and fundamental limitations of reinforcement learning from human feedback , author=. arXiv preprint arXiv:2307.15217 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

International Conference on Machine Learning , pages=

Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[44]

International conference on machine learning , pages=

Trust region policy optimization , author=. International conference on machine learning , pages=. 2015 , organization=

2015
[45]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning , author=. arXiv preprint arXiv:2504.11456 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

2025 , journal =

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset , author =. 2025 , journal =

2025
[48]

2025 , url =

A new era of intelligence with Gemini 3 , author =. 2025 , url =

2025
[49]

2013 , publisher=

Course of theoretical physics , author=. 2013 , publisher=

2013
[50]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[51]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions , author =
[52]

American Invitational Mathematics Examination (AIME) 2025 , author=

2025
[53]

American Invitational Mathematics Examination (AIME) 2024 , author=

2024
[54]

Hugging Face repository , volume=

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=
[56]

First Conference on Language Modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=
[57]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=
[58]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

2024
[59]

The International Journal of Robotics Research , volume=

Diffusion policy: Visuomotor policy learning via action diffusion , author=. The International Journal of Robotics Research , volume=. 2025 , publisher=

2025
[60]

2025 , url=

gpt-oss-120b&gpt-oss-20b Model Card , author=. 2025 , url=

2025
[62]

The Fourteenth International Conference on Learning Representations , year=

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning , author=. The Fourteenth International Conference on Learning Representations , year=
[63]

2025 , eprint=

xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning , author=. 2025 , eprint=

2025
[64]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024
[65]

2024 , eprint=

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , author=. 2024 , eprint=

2024
[66]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[67]

Arora, Yu Bai, Bowen Baker, Hai-Biao Bao, Boaz Barak, Ally Bennett, Tyler Bertao, N

OpenAI Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Hai-Biao Bao, Boaz Barak, Ally Bennett, Tyler Bertao, N. Archer Brett, Eugene Brevdo, Greg Brockman, S \'e bastien Bubeck, Cheng Chang, Kai Chen, and 105 others. 2025. https://api.semanticscholar.org/CorpusID:280671456 gpt-oss-120b&...

2025
[68]

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. 2025. https://matharena.ai/ Matharena: Evaluating llms on uncontaminated math competitions

2025
[69]

Lingjiao Chen, Matei Zaharia, and James Zou. 2023. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Del Carmen Hipolito Garcia, Menglin Xia, Laks VS Lakshmanan, Qingyun Wu, and Victor R \"u hle. 2025. Best-route: Adaptive llm routing with test-time optimal compute. arXiv preprint arXiv:2506.22716

work page arXiv 2025
[71]

Tao Feng, Yanzhen Shen, and Jiaxuan You. 2024. Graphrouter: A graph-based router for llm selections. arXiv preprint arXiv:2410.03834

work page arXiv 2024
[72]

Google. 2025. https://blog.google/products/gemini/gemini-3 A new era of intelligence with gemini 3

2025
[73]

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. https://doi.org/10.18653/v1/2024.acl-long.211 O lympiad B ench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems . In Proceedings ...

work page doi:10.18653/v1/2024.acl-long.211 2024
[74]

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2026. https://openreview.net/forum?id=kHB5Te5IWm Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning . In...

2026
[75]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[76]

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. 2024. Routerbench: A benchmark for multi-llm routing system. arXiv preprint arXiv:2403.12031

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, and 1 others. 2024. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 13(9):9

2024
[78]

Ian R McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, and 1 others. 2023. Inverse scaling: When bigger isn't better. arXiv preprint arXiv:2306.09479

work page arXiv 2023
[79]

Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. 2025. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. arXiv preprint arXiv:2504.16891

work page arXiv 2025
[80]

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. 2024. Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665

work page internal anchor Pith review Pith/arXiv arXiv 2024
[81]

Cheng Qian, Zuxin Liu, Shirley Kokane, Akshara Prabhakar, Jielin Qiu, Haolin Chen, Zhiwei Liu, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, and Huan Wang. 2025. https://arxiv.org/abs/2510.08439 xrouter: Training cost-aware llms orchestration system via reinforcement learning . Preprint, arXiv:2510.08439

work page arXiv 2025
[82]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. 2024. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling

2024
[83]

Marija S akota, Maxime Peyrard, and Robert West. 2024. Fly-swat or cannon? cost-effective language model choice via meta-modeling. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 606--615

2024
[84]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[85]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://arxiv.org/abs/2402.03300 Deepseekmath: Pushing the limits of mathematical reasoning in open language models . Preprint, arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[86]

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256

work page internal anchor Pith review Pith/arXiv arXiv 2024
[87]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. 2024. https://arxiv.org/abs/2406.01574 Mmlu-pro: A more robust and challenging multi-task language understanding benchmark . Preprint, arXiv:2406.01574

work page internal anchor Pith review Pith/arXiv arXiv 2024
[88]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[89]

Haozhen Zhang, Tao Feng, and Jiaxuan You. 2025. Router-r1: Teaching llms multi-round routing and aggregation via reinforcement learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

2025
[90]

Yifan Zhang and Team Math-AI. 2024. American invitational mathematics examination (aime) 2024

2024
[91]

Yifan Zhang and Team Math-AI. 2025. American invitational mathematics examination (aime) 2025

2025
[92]

Xueliang Zhao, Wei Wu, Jian Guan, and Lingpeng Kong. 2025. Promptcot: Synthesizing olympiad-level problems for mathematical reasoning in large language models. arXiv preprint arXiv:2503.02324

work page arXiv 2025

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[2] [2]

Publications Manual , year = "1983", publisher =

1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[5] [5]

Dan Gusfield , title =. 1997

1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[8] [8]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000

[9] [9]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980

[10] [10]

M. J. Kearns , title =

[11] [11]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983

[12] [12]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000

[13] [13]

Suppressed for Anonymity , author=

[14] [14]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981

[15] [15]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959

[16] [16]

OpenAI GPT-5 System Card

OpenAI GPT-5 System Card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=

[18] [18]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Router-r1: Teaching llms multi-round routing and aggregation via reinforcement learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[19] [20]

Proceedings of the 17th ACM International Conference on Web Search and Data Mining , pages=

Fly-swat or cannon? cost-effective language model choice via meta-modeling , author=. Proceedings of the 17th ACM International Conference on Web Search and Data Mining , pages=

[20] [22]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[21] [24]

2016 , publisher=

Simulation and the Monte Carlo method , author=. 2016 , publisher=

2016

[22] [27]

Advances in Neural Information Processing Systems , volume=

Routerdc: Query-based router by dual contrastive learning for assembling large language models , author=. Advances in Neural Information Processing Systems , volume=

[23] [29]

arXiv preprint arXiv:2510.19208 , year=

DiSRouter: Distributed Self-Routing for LLM Selections , author=. arXiv preprint arXiv:2510.19208 , year=

work page arXiv

[24] [32]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

[25] [34]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [35]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [36]

Proceedings of the nineteenth international conference on machine learning , pages=

Approximately optimal approximate reinforcement learning , author=. Proceedings of the nineteenth international conference on machine learning , pages=

[28] [39]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

The lessons of developing process reward models in mathematical reasoning , author=. arXiv preprint arXiv:2501.07301 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [40]

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Rewarding progress: Scaling automated process verifiers for llm reasoning , author=. arXiv preprint arXiv:2410.08146 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [41]

arXiv preprint arXiv:2505.02387 , year=

Rm-r1: Reward modeling as reasoning , author=. arXiv preprint arXiv:2505.02387 , year=

work page arXiv

[31] [42]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Open problems and fundamental limitations of reinforcement learning from human feedback , author=. arXiv preprint arXiv:2307.15217 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [43]

International Conference on Machine Learning , pages=

Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[33] [44]

International conference on machine learning , pages=

Trust region policy optimization , author=. International conference on machine learning , pages=. 2015 , organization=

2015

[34] [45]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning , author=. arXiv preprint arXiv:2504.11456 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [46]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [47]

2025 , journal =

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset , author =. 2025 , journal =

2025

[37] [48]

2025 , url =

A new era of intelligence with Gemini 3 , author =. 2025 , url =

2025

[38] [49]

2013 , publisher=

Course of theoretical physics , author=. 2013 , publisher=

2013

[39] [50]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[40] [51]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions , author =

[41] [52]

American Invitational Mathematics Examination (AIME) 2025 , author=

2025

[42] [53]

American Invitational Mathematics Examination (AIME) 2024 , author=

2024

[43] [54]

Hugging Face repository , volume=

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=

[44] [56]

First Conference on Language Modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=

[45] [57]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

[46] [58]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

2024

[47] [59]

The International Journal of Robotics Research , volume=

Diffusion policy: Visuomotor policy learning via action diffusion , author=. The International Journal of Robotics Research , volume=. 2025 , publisher=

2025

[48] [60]

2025 , url=

gpt-oss-120b&gpt-oss-20b Model Card , author=. 2025 , url=

2025

[49] [62]

The Fourteenth International Conference on Learning Representations , year=

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning , author=. The Fourteenth International Conference on Learning Representations , year=

[50] [63]

2025 , eprint=

xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning , author=. 2025 , eprint=

2025

[51] [64]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024

[52] [65]

2024 , eprint=

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , author=. 2024 , eprint=

2024

[53] [66]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[54] [67]

Arora, Yu Bai, Bowen Baker, Hai-Biao Bao, Boaz Barak, Ally Bennett, Tyler Bertao, N

OpenAI Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Hai-Biao Bao, Boaz Barak, Ally Bennett, Tyler Bertao, N. Archer Brett, Eugene Brevdo, Greg Brockman, S \'e bastien Bubeck, Cheng Chang, Kai Chen, and 105 others. 2025. https://api.semanticscholar.org/CorpusID:280671456 gpt-oss-120b&...

2025

[55] [68]

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. 2025. https://matharena.ai/ Matharena: Evaluating llms on uncontaminated math competitions

2025

[56] [69]

Lingjiao Chen, Matei Zaharia, and James Zou. 2023. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [70]

Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Del Carmen Hipolito Garcia, Menglin Xia, Laks VS Lakshmanan, Qingyun Wu, and Victor R \"u hle. 2025. Best-route: Adaptive llm routing with test-time optimal compute. arXiv preprint arXiv:2506.22716

work page arXiv 2025

[58] [71]

Tao Feng, Yanzhen Shen, and Jiaxuan You. 2024. Graphrouter: A graph-based router for llm selections. arXiv preprint arXiv:2410.03834

work page arXiv 2024

[59] [72]

Google. 2025. https://blog.google/products/gemini/gemini-3 A new era of intelligence with gemini 3

2025

[60] [73]

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. https://doi.org/10.18653/v1/2024.acl-long.211 O lympiad B ench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems . In Proceedings ...

work page doi:10.18653/v1/2024.acl-long.211 2024

[61] [74]

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2026. https://openreview.net/forum?id=kHB5Te5IWm Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning . In...

2026

[62] [75]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021

[63] [76]

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. 2024. Routerbench: A benchmark for multi-llm routing system. arXiv preprint arXiv:2403.12031

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [77]

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, and 1 others. 2024. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 13(9):9

2024

[65] [78]

Ian R McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, and 1 others. 2023. Inverse scaling: When bigger isn't better. arXiv preprint arXiv:2306.09479

work page arXiv 2023

[66] [79]

Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. 2025. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. arXiv preprint arXiv:2504.16891

work page arXiv 2025

[67] [80]

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. 2024. Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [81]

Cheng Qian, Zuxin Liu, Shirley Kokane, Akshara Prabhakar, Jielin Qiu, Haolin Chen, Zhiwei Liu, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, and Huan Wang. 2025. https://arxiv.org/abs/2510.08439 xrouter: Training cost-aware llms orchestration system via reinforcement learning . Preprint, arXiv:2510.08439

work page arXiv 2025

[69] [82]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. 2024. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling

2024

[70] [83]

Marija S akota, Maxime Peyrard, and Robert West. 2024. Fly-swat or cannon? cost-effective language model choice via meta-modeling. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 606--615

2024

[71] [84]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[72] [85]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://arxiv.org/abs/2402.03300 Deepseekmath: Pushing the limits of mathematical reasoning in open language models . Preprint, arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[73] [86]

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256

work page internal anchor Pith review Pith/arXiv arXiv 2024

[74] [87]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. 2024. https://arxiv.org/abs/2406.01574 Mmlu-pro: A more robust and challenging multi-task language understanding benchmark . Preprint, arXiv:2406.01574

work page internal anchor Pith review Pith/arXiv arXiv 2024

[75] [88]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[76] [89]

Haozhen Zhang, Tao Feng, and Jiaxuan You. 2025. Router-r1: Teaching llms multi-round routing and aggregation via reinforcement learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

2025

[77] [90]

Yifan Zhang and Team Math-AI. 2024. American invitational mathematics examination (aime) 2024

2024

[78] [91]

Yifan Zhang and Team Math-AI. 2025. American invitational mathematics examination (aime) 2025

2025

[79] [92]

Xueliang Zhao, Wei Wu, Jian Guan, and Lingpeng Kong. 2025. Promptcot: Synthesizing olympiad-level problems for mathematical reasoning in large language models. arXiv preprint arXiv:2503.02324

work page arXiv 2025