arxiv: 2605.10805 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.CL· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

Wenbo Zhang , Lijinghua Zhang , Liner Xiang , Hengrui Cai

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:14 UTC · model grok-4.3

classification 💻 cs.AI cs.CLstat.ML

keywords LLM-as-a-Judgedistributionally robust optimizationroutingdistribution shiftcost efficiencyreasoning modelsKL-divergence

0 comments

The pith

Reasoning in LLM judges improves accuracy on complex tasks like math and coding but adds little value and high cost on simpler ones, motivating selective routing via robust optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Controlled comparisons reveal that explicit reasoning boosts judgment accuracy mainly on structured verification tasks while delivering limited or negative returns elsewhere and always raising compute costs. The paper therefore frames routing between reasoning and non-reasoning modes as a constrained distributionally robust optimization problem that must respect a fixed budget. RACER solves this by centering a KL-divergence uncertainty set around a nominal distribution, yielding an efficient primal-dual algorithm with uniqueness and linear convergence guarantees. A reader should care because LLM-as-a-judge pipelines are now common yet routinely overpay for reasoning that does not generalize when data shifts.

Core claim

The central claim is that routing decisions between reasoning and non-reasoning LLM judges can be cast as a distributionally robust optimization problem whose solution, obtained via an efficient primal-dual method, selects the cheaper judge whenever it suffices while protecting against distribution shift through a KL-divergence ball; the resulting policy is unique and converges linearly, delivering superior accuracy-cost frontiers in experiments that simulate realistic shifts.

What carries the argument

RACER, the routing policy obtained by solving a constrained distributionally robust optimization problem that uses a KL-divergence uncertainty set to hedge against shifts in the judge's input distribution.

If this is right

Under a fixed compute budget, selective routing yields higher overall judgment accuracy than always using the reasoning judge.
The policy remains effective when test inputs differ from training inputs, unlike non-robust baselines.
The primal-dual algorithm scales to large numbers of routing decisions with linear convergence.
Uniqueness of the optimal policy removes ambiguity in how to allocate the budget across instances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same robust-routing idea could be applied to other binary capability choices in LLMs, such as whether to invoke chain-of-thought on a per-query basis.
If the KL-ball approximation holds, similar uncertainty sets might protect multi-model orchestration systems that switch among several judge back-ends.
Future work could test whether training a single model to predict its own routing decision inside the robust objective removes the need for a separate router.

Load-bearing premise

The performance gap between reasoning and non-reasoning judges under real distribution shift can be adequately captured by a KL-divergence ball around a nominal distribution so that the robust policy generalizes.

What would settle it

On a held-out test set exhibiting genuine distribution shift, a simple non-robust router or fixed reasoning threshold achieves a strictly better accuracy-cost curve than the RACER policy learned from the same training data.

Figures

Figures reproduced from arXiv: 2605.10805 by Hengrui Cai, Lijinghua Zhang, Liner Xiang, Wenbo Zhang.

**Figure 1.** Figure 1: (a) Reasoning models outperform non-reasoning models on difficult tasks, achieving higher accuracy at the cost of increased computation, while offering only marginal gains on simple tasks. (b) RACER remains robust to out-of-distribution (OOD) inputs by operating over a KL-divergence uncertainty set, whereas standard routing policies fail under distribution shift. are used as evaluators. In parallel, reason… view at source ↗

**Figure 2.** Figure 2: Accuracy–cost trade-offs and reasoning–instructional agreement across benchmarks. Upper: Accuracy improvement versus cost ratio. Lower: Agreement patterns between instruct and reasoning inference. on in-distribution data may mis-estimate either the benefit of reasoning or the risk of budget violation under distribution shift. This motivates our distributionally robust objective in Eq. (4), which applies ro… view at source ↗

**Figure 3.** Figure 3: Reward–cost trade-offs evaluated on the OOD dataset with a budget of 2. (a) shows that reward-robust RACER-R improves performance when the cost constraint is easily satisfied. (b) illustrates the importance of cost-robust RACER-C when the cost constraint can be violated under distribution shift. first focuses on controlled ablations designed to isolate the roles of reward and cost robustness under targete… view at source ↗

**Figure 4.** Figure 4: Routing performance across compute budgets on ID and OOD benchmarks (top to bottom: Qwen3-1.7B, 4B, and 8B). an ablation study to investigate (i) when reward robustness is necessary and (ii) when cost robustness is necessary. Specifically, we introduce two variants, RACER-R and RACER-C, which apply distributionally robust reweighting only to the reward and only to the cost, respectively. We also include A… view at source ↗

read the original abstract

Reasoning-capable large language models (LLMs) have recently been adopted as automated judges, but their benefits and costs in LLM-as-a-Judge settings remain unclear. Through controlled comparisons between reasoning and non-reasoning judges, we show that explicit reasoning substantially improves judgment accuracy on tasks requiring structured verification (e.g., math and coding), while offering limited or even negative gains on simpler evaluations and incurring significantly higher computational cost. These findings motivate that reasoning should be used selectively rather than universally, with awareness of possible distribution shift. We propose a Robust Adaptive Cost-Efficient Routing (RACER), which dynamically selects between reasoning and non-reasoning judges under a fixed budget by formulating routing as a constrained distributionally robust optimization problem. RACER explicitly accounts for distribution shift via a KL-divergence uncertainty set, admits an efficient primal--dual algorithm, and enjoys theoretical guarantees including uniqueness of the optimal policy and linear convergence. Extensive experiments show that RACER achieves superior accuracy--cost trade-offs under distribution shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RACER routes LLM judges via constrained KL-DRO to trade accuracy against cost under shift, with theory and experiments, but the uncertainty set may not match the task-type structure they identify.

read the letter

RACER is a routing algorithm that uses constrained DRO with a KL uncertainty set to pick reasoning or non-reasoning judges while respecting a cost budget and hedging against distribution shift. They first show through comparisons that reasoning lifts accuracy on math and coding but adds little or even hurts on simpler tasks while burning more compute. That leads to the optimization setup for dynamic selection under uncertainty. The primal-dual solver and the stated guarantees on unique policy plus linear convergence are concrete pieces of work. If the experiments check out, the reported accuracy-cost gains under shift would be useful for anyone running repeated evaluations. A soft spot is the modeling of the uncertainty. The performance difference is tied to task type, so shifts that alter the fraction of verification-heavy instances may not sit neatly inside a KL ball around the nominal distribution. The abstract does not spell out how the radius and budget are set or how the shifted test distributions are constructed, which leaves room for the gains to depend on those choices. Full derivations and error bars would help judge whether the robustness is as general as claimed. This paper is for researchers and engineers who run large-scale LLM-as-a-judge pipelines and need to control cost when data can drift. Readers working on adaptive inference or robust optimization applied to evaluation would find the routing formulation worth looking at. It deserves peer review because the problem is real, the method is explicit, and the motivation is grounded, even if the uncertainty modeling needs closer checking.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes trade-offs between reasoning and non-reasoning LLMs used as judges, finding that explicit reasoning boosts accuracy on structured verification tasks (math, coding) but adds cost with limited or negative gains on simpler tasks. It proposes RACER, which routes between the two under a fixed budget by casting the problem as constrained distributionally robust optimization with a KL-divergence uncertainty set around a nominal distribution. The method admits an efficient primal-dual algorithm and carries theoretical guarantees of unique optimal policy and linear convergence. Experiments are reported to show improved accuracy-cost trade-offs under distribution shift.

Significance. If the central claims hold, the work supplies a theoretically grounded, algorithmically efficient framework for cost-aware selection of LLM judges that explicitly models distribution shift. The combination of robust optimization, primal-dual solvability, and uniqueness guarantees is a clear strength that could inform practical LLM-as-a-Judge deployments where reasoning budgets are limited.

major comments (2)

The modeling assumption that performance differences between reasoning and non-reasoning judges under real distribution shift lie inside a KL-divergence ball (used to define the uncertainty set in the robust formulation) is load-bearing for both the theoretical guarantees and the empirical superiority claim. Structured shifts that alter the relative value of reasoning across task types (e.g., changing the fraction of verification-heavy instances) may not be captured by KL divergence; the manuscript should provide a concrete sensitivity analysis or counter-example test showing when the primal-dual policy remains optimal outside the training distribution.
The abstract and experimental section claim superior accuracy-cost trade-offs, yet the uncertainty-set radius and budget constraint are free parameters whose selection procedure is not fully detailed. Without explicit reporting of how these are chosen on held-out data and without error bars or statistical tests on the reported trade-off curves, it is impossible to rule out that post-hoc tuning contributes to the observed advantage.

minor comments (2)

Figure captions and axis labels should explicitly state the evaluation metrics (accuracy, cost) and the distribution-shift construction used in each panel.
The notation for the nominal distribution, uncertainty radius, and dual variables should be introduced consistently in the first theoretical section rather than piecemeal.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major point below and will incorporate revisions to improve clarity, rigor, and completeness of the manuscript.

read point-by-point responses

Referee: The modeling assumption that performance differences between reasoning and non-reasoning judges under real distribution shift lie inside a KL-divergence ball (used to define the uncertainty set in the robust formulation) is load-bearing for both the theoretical guarantees and the empirical superiority claim. Structured shifts that alter the relative value of reasoning across task types (e.g., changing the fraction of verification-heavy instances) may not be captured by KL divergence; the manuscript should provide a concrete sensitivity analysis or counter-example test showing when the primal-dual policy remains optimal outside the training distribution.

Authors: We agree that the KL-divergence uncertainty set is a central modeling choice that enables the tractability, uniqueness, and linear convergence results. KL divergence is a standard and theoretically well-studied choice in distributionally robust optimization because it yields a convex dual problem and admits the efficient primal-dual algorithm presented. To directly address the request, the revised manuscript will include a new sensitivity analysis subsection. This will report results for a range of uncertainty radii, synthetic distribution shifts that explicitly vary the fraction of verification-heavy tasks (math/coding vs. simpler evaluations), and identification of the radius threshold beyond which the optimal policy changes. We will also discuss the limitation that extreme structured shifts lying far outside any reasonable KL ball may require alternative uncertainty sets, while showing that the shifts present in our experimental datasets remain well inside the ball for the radii used. revision: yes
Referee: The abstract and experimental section claim superior accuracy-cost trade-offs, yet the uncertainty-set radius and budget constraint are free parameters whose selection procedure is not fully detailed. Without explicit reporting of how these are chosen on held-out data and without error bars or statistical tests on the reported trade-off curves, it is impossible to rule out that post-hoc tuning contributes to the observed advantage.

Authors: We appreciate this observation on experimental rigor. The radius and budget parameters were chosen via grid search on a held-out validation split to maximize accuracy subject to the cost constraint; however, the manuscript does not report this procedure or variability measures in sufficient detail. In the revision we will add an explicit subsection describing the hyperparameter selection protocol (including the validation split size and search ranges), report the exact values used for each experiment, include error bars computed from five independent runs with different random seeds, and add statistical significance tests (paired t-tests and Wilcoxon signed-rank tests) comparing RACER against baselines on the accuracy-cost curves. These additions will make the superiority claims more robust and transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained optimization

full rationale

The paper formulates routing as a constrained DRO problem with KL uncertainty set, then derives a primal-dual algorithm and proves uniqueness/linear convergence from the convex structure of the problem. These steps follow standard DRO duality and do not reduce to fitted parameters or self-citations by construction. Empirical claims rest on separate experiments rather than tautological re-use of inputs. No load-bearing step matches any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard convex optimization assumptions and the modeling choice that KL divergence adequately represents judge-performance shift; no new physical entities are postulated.

free parameters (2)

uncertainty set radius
Controls the size of the KL ball around the nominal distribution and must be chosen or tuned for the robust policy.
budget constraint
Fixed computational budget that the routing policy must respect.

axioms (2)

standard math The primal-dual algorithm converges linearly under the stated convexity and constraint qualification conditions.
Invoked to guarantee efficient solvability of the routing problem.
domain assumption Judge performance differences under distribution shift lie inside a KL ball of finite radius.
Core modeling assumption that enables the robust formulation.

pith-pipeline@v0.9.0 · 5486 in / 1399 out tokens · 47167 ms · 2026-05-12T04:14:52.568840+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RACER explicitly accounts for distribution shift via a KL-divergence uncertainty set... ρ(i)∝ρ_n(i) exp((s−f_i)/τ)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and J-cost positivity unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

max_π min_λ R_U(π) − λ C_U(π) with linear convergence under convexity and bounded density ratios

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 9 internal anchors

[1]

AutoMix: Automatically mixing language models

Aggarwal, P., Madaan, A., Anand, A., Potharaju, S. P., Mishra, S., Zhou, P., Gupta, A., Rajagopal, D., Kappa- ganthu, K., Yang, Y ., et al. Automix: Automatically mix- ing language models.arXiv preprint arXiv:2310.12963,

work page arXiv
[2]

Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025

Bercovich, A., Levy, I., Golan, I., Dabbah, M., El-Yaniv, R., Puny, O., Galil, I., Moshe, Z., Ronen, T., Nabwani, N., et al. Llama-nemotron: Efficient reasoning models. arXiv preprint arXiv:2505.00949,

work page arXiv
[3]

Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences

Chakraborty, S., Qiu, J., Yuan, H., Koppel, A., Huang, F., Manocha, D., Bedi, A., and Wang, M. Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences. InICML 2024 Workshop on Models of Human Feedback for AI Alignment,

work page 2024
[4]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., and Liu, Z. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 4(5), 2024a. Chen, L., Zaharia, M., and Zou, J. Frugalgpt: How to use large language models while reducing cost and improving performance....

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Judgelrm: Large reasoning models as a judge, arXiv preprint arXiv:2504.00050, 2025

Chen, N., Hu, Z., Zou, Q., Wu, J., Wang, Q., Hooi, B., and He, B. Judgelrm: Large reasoning models as a judge. arXiv preprint arXiv:2504.00050, 2025a. Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, M., Zhang, Z., et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187, ...

work page arXiv
[6]

Ding, D., Zhang, K., Duan, J., Ba s ¸ar, T., and Jovanovi ´c, M. R. Convergence and sample complexity of natural pol- icy gradient primal-dual methods for constrained mdps. arXiv preprint arXiv:2206.02346,

work page arXiv
[7]

arXiv preprint arXiv:2502.14855 , year=

Frick, E., Chen, C., Tennyson, J., Li, T., Chiang, W.-L., An- gelopoulos, A. N., and Stoica, I. Prompt-to-leaderboard. arXiv preprint arXiv:2502.14855,

work page arXiv
[8]

K., Jiang, Z., and Liu, P

Fu, J., Ng, S. K., Jiang, Z., and Liu, P. Gptscore: Evaluate as you desire. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 6556–6576,

work page 2024
[9]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T

Hu, Q. J., Bieker, J., Li, X., Jiang, N., Keigwin, B., Ran- ganath, G., Keutzer, K., and Upadhyay, S. K. Router- bench: A benchmark for multi-llm routing system.arXiv preprint arXiv:2403.12031,

work page arXiv
[12]

Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024

Lai, X., Tian, Z., Chen, Y ., Yang, S., Peng, X., and Jia, J. Step-dpo: Step-wise preference optimiza- tion for long-chain reasoning of llms.arXiv preprint arXiv:2406.18629,

work page arXiv
[13]

10 Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge Lambert, N., Pyatkin, V ., Morrison, J., Miranda, L. J. V ., Lin, B. Y ., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y ., et al. Rewardbench: Evaluating reward models for language modeling. InFindings of the Association for Computational Linguistics: NAACL 2025, pp....

work page 2025
[14]

InAAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 24312–24320

Liang, G., Zhong, L., Yang, Z., and Quan, X. Thinkswitcher: When to think hard, when to think fast.arXiv preprint arXiv:2505.14183,

work page arXiv
[15]

task-updates

Liu, C. Y ., Zeng, L., Liu, J., Yan, R., He, J., Wang, C., Yan, S., Liu, Y ., and Zhou, Y . Skywork-reward: Bag of tricks for reward modeling in llms.arXiv preprint arXiv:2410.18451, 2024a. Liu, Y ., Iter, D., Xu, Y ., Wang, S., Xu, R., and Zhu, C. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634,

work page arXiv
[16]

Rm-bench: Benchmarking reward models of language models with subtlety and style.arXiv preprint arXiv:2410.16184, 2024

Liu, Y ., Yao, Z., Min, R., Cao, Y ., Hou, L., and Li, J. Rm- bench: Benchmarking reward models of language models with subtlety and style.arXiv preprint arXiv:2410.16184, 2024b. Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

work page arXiv
[17]

RewardBench 2: Advancing Reward Model Evaluation

URL https://arxiv.org/abs/2506.01937. Mutti, M., De Santi, R., De Bartolomeis, P., and Restelli, M. Convex reinforcement learning in finite trials.Journal of Machine Learning Research, 24(250):1–42,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Olmo 3

Olmo, T., Ettinger, A., Bertsch, A., Kuehl, B., Graham, D., Heineman, D., Groeneveld, D., Brahman, F., Tim- bers, F., Ivison, H., et al. Olmo 3.arXiv preprint arXiv:2512.13961,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

RouteLLM: Learning to Route LLMs with Preference Data

Ong, I., Almahairi, A., Wu, V ., Chiang, W.-L., Wu, T., Gon- zalez, J. E., Kadous, M. W., and Stoica, I. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al

Accessed: 2025-12-28. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744,

work page 2025
[21]

and MEHROTRA, S

Rahimian, H. and Mehrotra, S. Distributionally robust op- timization: A review.arXiv preprint arXiv:1908.05659,

work page arXiv 1908
[22]

Learning to plan & reason for evaluation with thinking-llm-as-a-judge, 2025

Saha, S., Li, X., Ghazvininejad, M., Weston, J., and Wang, T. Learning to plan & reason for evaluation with thinking- llm-as-a-judge.arXiv preprint arXiv:2501.18099,

work page arXiv
[23]

R., Paige, B., and Bo- gunovic, I

Son, S., Bankes, W., Chowdhury, S. R., Paige, B., and Bo- gunovic, I. Right now, wrong then: Non-stationary direct preference optimization under preference drift.arXiv preprint arXiv:2407.18676,

work page arXiv
[24]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Sui, Y ., Chuang, Y .-N., Wang, G., Zhang, J., Zhang, T., Yuan, J., Liu, H., Wen, A., Zhong, S., Zou, N., et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,

work page internal anchor Pith review arXiv
[25]

Judgebench: A benchmark for evaluating llm-based judges,

Tan, S., Zhuang, S., Montgomery, K., Tang, W. Y ., Cuadron, A., Wang, C., Popa, R. A., and Stoica, I. Judgebench: A benchmark for evaluating llm-based judges, 2024a. URL https://arxiv.org/abs/2410.12784. Tan, S., Zhuang, S., Montgomery, K., Tang, W. Y ., Cuadron, A., Wang, C., Popa, R. A., and Stoica, I. Judgebench: A benchmark for evaluating llm-based ju...

work page arXiv
[26]

Whitehouse, T

Whitehouse, C., Wang, T., Yu, P., Li, X., Weston, J., Ku- likov, I., and Saha, S. J1: Incentivizing thinking in llm- as-a-judge via reinforcement learning.arXiv preprint arXiv:2505.10320,

work page arXiv
[27]

Robust llm alignment via distribution- ally robust direct preference optimization.arXiv preprint arXiv:2502.01930,

Xu, Z., Vemuri, S., Panaganti, K., Kalathil, D., Jain, R., and Ramachandran, D. Robust llm alignment via distribution- ally robust direct preference optimization.arXiv preprint arXiv:2502.01930,

work page arXiv
[28]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Synapseroute: An auto- route switching framework on dual-state large language model.arXiv preprint arXiv:2507.02822, 2025a

Zhang, W., Qiao, S., Luo, L., Li, Y ., Zheng, C., Xu, Q., Li, M., Gui, Y ., He, Y ., Qiu, J., et al. Synapseroute: An auto- route switching framework on dual-state large language model.arXiv preprint arXiv:2507.02822, 2025a. Zhang, Y ., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J., et al. Qwen3 embed- ding: Advancing...

work page arXiv
[31]

12 Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge Appendix The appendix provides additional theoretical analysis, experimental details, and supplementary experimental results. In Section A, we establish the existence and uniqueness of the saddle point and prove last-iterate convergence, providing theoretical guarantees fo...

work page 2022
[32]

Step 1.We first derive the closed-form solution to (6). According to (6), we optimize the following objective max π∈Π E(z,l)∼ρ,a∼π(·|z)[r(z, a, l)]−λ tE(z,l)∼ρ,a∼π(·|z)[c(z, a)] +βE (z,l)∼ρ [H(π(· |z))].(A.11) 16 Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge Analogously to arguments used in the proof of DPO (Rafailov et ...

work page 2023
[33]

and CODE-PREFERENCE-PAIRS (Vezora, 2024). As shown in the table C.2, RACER consistently outperforms random routing at matched budgets and often surpasses all-reasoning accuracy at lower cost, showing that RACER has generalization with respect to the model family. C.4. Analysis of Self-Routing Behavior Our approach learns the routing policy from data. As a...

work page 2024