pith. machine review for the scientific record. sign in

arxiv: 2605.10805 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.CL· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:14 UTC · model grok-4.3

classification 💻 cs.AI cs.CLstat.ML
keywords LLM-as-a-Judgedistributionally robust optimizationroutingdistribution shiftcost efficiencyreasoning modelsKL-divergence
0
0 comments X

The pith

Reasoning in LLM judges improves accuracy on complex tasks like math and coding but adds little value and high cost on simpler ones, motivating selective routing via robust optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Controlled comparisons reveal that explicit reasoning boosts judgment accuracy mainly on structured verification tasks while delivering limited or negative returns elsewhere and always raising compute costs. The paper therefore frames routing between reasoning and non-reasoning modes as a constrained distributionally robust optimization problem that must respect a fixed budget. RACER solves this by centering a KL-divergence uncertainty set around a nominal distribution, yielding an efficient primal-dual algorithm with uniqueness and linear convergence guarantees. A reader should care because LLM-as-a-judge pipelines are now common yet routinely overpay for reasoning that does not generalize when data shifts.

Core claim

The central claim is that routing decisions between reasoning and non-reasoning LLM judges can be cast as a distributionally robust optimization problem whose solution, obtained via an efficient primal-dual method, selects the cheaper judge whenever it suffices while protecting against distribution shift through a KL-divergence ball; the resulting policy is unique and converges linearly, delivering superior accuracy-cost frontiers in experiments that simulate realistic shifts.

What carries the argument

RACER, the routing policy obtained by solving a constrained distributionally robust optimization problem that uses a KL-divergence uncertainty set to hedge against shifts in the judge's input distribution.

If this is right

  • Under a fixed compute budget, selective routing yields higher overall judgment accuracy than always using the reasoning judge.
  • The policy remains effective when test inputs differ from training inputs, unlike non-robust baselines.
  • The primal-dual algorithm scales to large numbers of routing decisions with linear convergence.
  • Uniqueness of the optimal policy removes ambiguity in how to allocate the budget across instances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same robust-routing idea could be applied to other binary capability choices in LLMs, such as whether to invoke chain-of-thought on a per-query basis.
  • If the KL-ball approximation holds, similar uncertainty sets might protect multi-model orchestration systems that switch among several judge back-ends.
  • Future work could test whether training a single model to predict its own routing decision inside the robust objective removes the need for a separate router.

Load-bearing premise

The performance gap between reasoning and non-reasoning judges under real distribution shift can be adequately captured by a KL-divergence ball around a nominal distribution so that the robust policy generalizes.

What would settle it

On a held-out test set exhibiting genuine distribution shift, a simple non-robust router or fixed reasoning threshold achieves a strictly better accuracy-cost curve than the RACER policy learned from the same training data.

Figures

Figures reproduced from arXiv: 2605.10805 by Hengrui Cai, Lijinghua Zhang, Liner Xiang, Wenbo Zhang.

Figure 1
Figure 1. Figure 1: (a) Reasoning models outperform non-reasoning models on difficult tasks, achieving higher accuracy at the cost of increased computation, while offering only marginal gains on simple tasks. (b) RACER remains robust to out-of-distribution (OOD) inputs by operating over a KL-divergence uncertainty set, whereas standard routing policies fail under distribution shift. are used as evaluators. In parallel, reason… view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy–cost trade-offs and reasoning–instructional agreement across benchmarks. Upper: Accuracy improvement versus cost ratio. Lower: Agreement patterns between instruct and reasoning inference. on in-distribution data may mis-estimate either the benefit of reasoning or the risk of budget violation under distribution shift. This motivates our distributionally robust objective in Eq. (4), which applies ro… view at source ↗
Figure 3
Figure 3. Figure 3: Reward–cost trade-offs evaluated on the OOD dataset with a budget of 2. (a) shows that reward-robust RACER-R im￾proves performance when the cost constraint is easily satisfied. (b) illustrates the importance of cost-robust RACER-C when the cost constraint can be violated under distribution shift. first focuses on controlled ablations designed to isolate the roles of reward and cost robustness under targete… view at source ↗
Figure 4
Figure 4. Figure 4: Routing performance across compute budgets on ID and OOD benchmarks (top to bottom: Qwen3-1.7B, 4B, and 8B). an ablation study to investigate (i) when reward robustness is necessary and (ii) when cost robustness is necessary. Specif￾ically, we introduce two variants, RACER-R and RACER-C, which apply distributionally robust reweighting only to the reward and only to the cost, respectively. We also include A… view at source ↗
read the original abstract

Reasoning-capable large language models (LLMs) have recently been adopted as automated judges, but their benefits and costs in LLM-as-a-Judge settings remain unclear. Through controlled comparisons between reasoning and non-reasoning judges, we show that explicit reasoning substantially improves judgment accuracy on tasks requiring structured verification (e.g., math and coding), while offering limited or even negative gains on simpler evaluations and incurring significantly higher computational cost. These findings motivate that reasoning should be used selectively rather than universally, with awareness of possible distribution shift. We propose a Robust Adaptive Cost-Efficient Routing (RACER), which dynamically selects between reasoning and non-reasoning judges under a fixed budget by formulating routing as a constrained distributionally robust optimization problem. RACER explicitly accounts for distribution shift via a KL-divergence uncertainty set, admits an efficient primal--dual algorithm, and enjoys theoretical guarantees including uniqueness of the optimal policy and linear convergence. Extensive experiments show that RACER achieves superior accuracy--cost trade-offs under distribution shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes trade-offs between reasoning and non-reasoning LLMs used as judges, finding that explicit reasoning boosts accuracy on structured verification tasks (math, coding) but adds cost with limited or negative gains on simpler tasks. It proposes RACER, which routes between the two under a fixed budget by casting the problem as constrained distributionally robust optimization with a KL-divergence uncertainty set around a nominal distribution. The method admits an efficient primal-dual algorithm and carries theoretical guarantees of unique optimal policy and linear convergence. Experiments are reported to show improved accuracy-cost trade-offs under distribution shift.

Significance. If the central claims hold, the work supplies a theoretically grounded, algorithmically efficient framework for cost-aware selection of LLM judges that explicitly models distribution shift. The combination of robust optimization, primal-dual solvability, and uniqueness guarantees is a clear strength that could inform practical LLM-as-a-Judge deployments where reasoning budgets are limited.

major comments (2)
  1. The modeling assumption that performance differences between reasoning and non-reasoning judges under real distribution shift lie inside a KL-divergence ball (used to define the uncertainty set in the robust formulation) is load-bearing for both the theoretical guarantees and the empirical superiority claim. Structured shifts that alter the relative value of reasoning across task types (e.g., changing the fraction of verification-heavy instances) may not be captured by KL divergence; the manuscript should provide a concrete sensitivity analysis or counter-example test showing when the primal-dual policy remains optimal outside the training distribution.
  2. The abstract and experimental section claim superior accuracy-cost trade-offs, yet the uncertainty-set radius and budget constraint are free parameters whose selection procedure is not fully detailed. Without explicit reporting of how these are chosen on held-out data and without error bars or statistical tests on the reported trade-off curves, it is impossible to rule out that post-hoc tuning contributes to the observed advantage.
minor comments (2)
  1. Figure captions and axis labels should explicitly state the evaluation metrics (accuracy, cost) and the distribution-shift construction used in each panel.
  2. The notation for the nominal distribution, uncertainty radius, and dual variables should be introduced consistently in the first theoretical section rather than piecemeal.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major point below and will incorporate revisions to improve clarity, rigor, and completeness of the manuscript.

read point-by-point responses
  1. Referee: The modeling assumption that performance differences between reasoning and non-reasoning judges under real distribution shift lie inside a KL-divergence ball (used to define the uncertainty set in the robust formulation) is load-bearing for both the theoretical guarantees and the empirical superiority claim. Structured shifts that alter the relative value of reasoning across task types (e.g., changing the fraction of verification-heavy instances) may not be captured by KL divergence; the manuscript should provide a concrete sensitivity analysis or counter-example test showing when the primal-dual policy remains optimal outside the training distribution.

    Authors: We agree that the KL-divergence uncertainty set is a central modeling choice that enables the tractability, uniqueness, and linear convergence results. KL divergence is a standard and theoretically well-studied choice in distributionally robust optimization because it yields a convex dual problem and admits the efficient primal-dual algorithm presented. To directly address the request, the revised manuscript will include a new sensitivity analysis subsection. This will report results for a range of uncertainty radii, synthetic distribution shifts that explicitly vary the fraction of verification-heavy tasks (math/coding vs. simpler evaluations), and identification of the radius threshold beyond which the optimal policy changes. We will also discuss the limitation that extreme structured shifts lying far outside any reasonable KL ball may require alternative uncertainty sets, while showing that the shifts present in our experimental datasets remain well inside the ball for the radii used. revision: yes

  2. Referee: The abstract and experimental section claim superior accuracy-cost trade-offs, yet the uncertainty-set radius and budget constraint are free parameters whose selection procedure is not fully detailed. Without explicit reporting of how these are chosen on held-out data and without error bars or statistical tests on the reported trade-off curves, it is impossible to rule out that post-hoc tuning contributes to the observed advantage.

    Authors: We appreciate this observation on experimental rigor. The radius and budget parameters were chosen via grid search on a held-out validation split to maximize accuracy subject to the cost constraint; however, the manuscript does not report this procedure or variability measures in sufficient detail. In the revision we will add an explicit subsection describing the hyperparameter selection protocol (including the validation split size and search ranges), report the exact values used for each experiment, include error bars computed from five independent runs with different random seeds, and add statistical significance tests (paired t-tests and Wilcoxon signed-rank tests) comparing RACER against baselines on the accuracy-cost curves. These additions will make the superiority claims more robust and transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained optimization

full rationale

The paper formulates routing as a constrained DRO problem with KL uncertainty set, then derives a primal-dual algorithm and proves uniqueness/linear convergence from the convex structure of the problem. These steps follow standard DRO duality and do not reduce to fitted parameters or self-citations by construction. Empirical claims rest on separate experiments rather than tautological re-use of inputs. No load-bearing step matches any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard convex optimization assumptions and the modeling choice that KL divergence adequately represents judge-performance shift; no new physical entities are postulated.

free parameters (2)
  • uncertainty set radius
    Controls the size of the KL ball around the nominal distribution and must be chosen or tuned for the robust policy.
  • budget constraint
    Fixed computational budget that the routing policy must respect.
axioms (2)
  • standard math The primal-dual algorithm converges linearly under the stated convexity and constraint qualification conditions.
    Invoked to guarantee efficient solvability of the routing problem.
  • domain assumption Judge performance differences under distribution shift lie inside a KL ball of finite radius.
    Core modeling assumption that enables the robust formulation.

pith-pipeline@v0.9.0 · 5486 in / 1399 out tokens · 47167 ms · 2026-05-12T04:14:52.568840+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 9 internal anchors

  1. [1]

    AutoMix: Automatically mixing language models

    Aggarwal, P., Madaan, A., Anand, A., Potharaju, S. P., Mishra, S., Zhou, P., Gupta, A., Rajagopal, D., Kappa- ganthu, K., Yang, Y ., et al. Automix: Automatically mix- ing language models.arXiv preprint arXiv:2310.12963,

  2. [2]

    Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025

    Bercovich, A., Levy, I., Golan, I., Dabbah, M., El-Yaniv, R., Puny, O., Galil, I., Moshe, Z., Ronen, T., Nabwani, N., et al. Llama-nemotron: Efficient reasoning models. arXiv preprint arXiv:2505.00949,

  3. [3]

    Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences

    Chakraborty, S., Qiu, J., Yuan, H., Koppel, A., Huang, F., Manocha, D., Bedi, A., and Wang, M. Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences. InICML 2024 Workshop on Models of Human Feedback for AI Alignment,

  4. [4]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., and Liu, Z. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 4(5), 2024a. Chen, L., Zaharia, M., and Zou, J. Frugalgpt: How to use large language models while reducing cost and improving performance....

  5. [5]

    Judgelrm: Large reasoning models as a judge, arXiv preprint arXiv:2504.00050, 2025

    Chen, N., Hu, Z., Zou, Q., Wu, J., Wang, Q., Hooi, B., and He, B. Judgelrm: Large reasoning models as a judge. arXiv preprint arXiv:2504.00050, 2025a. Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, M., Zhang, Z., et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187, ...

  6. [6]

    Ding, D., Zhang, K., Duan, J., Ba s ¸ar, T., and Jovanovi ´c, M. R. Convergence and sample complexity of natural pol- icy gradient primal-dual methods for constrained mdps. arXiv preprint arXiv:2206.02346,

  7. [7]

    arXiv preprint arXiv:2502.14855 , year=

    Frick, E., Chen, C., Tennyson, J., Li, T., Chiang, W.-L., An- gelopoulos, A. N., and Stoica, I. Prompt-to-leaderboard. arXiv preprint arXiv:2502.14855,

  8. [8]

    K., Jiang, Z., and Liu, P

    Fu, J., Ng, S. K., Jiang, Z., and Liu, P. Gptscore: Evaluate as you desire. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 6556–6576,

  9. [9]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  11. [11]

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T

    Hu, Q. J., Bieker, J., Li, X., Jiang, N., Keigwin, B., Ran- ganath, G., Keutzer, K., and Upadhyay, S. K. Router- bench: A benchmark for multi-llm routing system.arXiv preprint arXiv:2403.12031,

  12. [12]

    Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024

    Lai, X., Tian, Z., Chen, Y ., Yang, S., Peng, X., and Jia, J. Step-dpo: Step-wise preference optimiza- tion for long-chain reasoning of llms.arXiv preprint arXiv:2406.18629,

  13. [13]

    10 Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge Lambert, N., Pyatkin, V ., Morrison, J., Miranda, L. J. V ., Lin, B. Y ., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y ., et al. Rewardbench: Evaluating reward models for language modeling. InFindings of the Association for Computational Linguistics: NAACL 2025, pp....

  14. [14]

    InAAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 24312–24320

    Liang, G., Zhong, L., Yang, Z., and Quan, X. Thinkswitcher: When to think hard, when to think fast.arXiv preprint arXiv:2505.14183,

  15. [15]

    task-updates

    Liu, C. Y ., Zeng, L., Liu, J., Yan, R., He, J., Wang, C., Yan, S., Liu, Y ., and Zhou, Y . Skywork-reward: Bag of tricks for reward modeling in llms.arXiv preprint arXiv:2410.18451, 2024a. Liu, Y ., Iter, D., Xu, Y ., Wang, S., Xu, R., and Zhu, C. G-eval: Nlg evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634,

  16. [16]

    Rm-bench: Benchmarking reward models of language models with subtlety and style.arXiv preprint arXiv:2410.16184, 2024

    Liu, Y ., Yao, Z., Min, R., Cao, Y ., Hou, L., and Li, J. Rm- bench: Benchmarking reward models of language models with subtlety and style.arXiv preprint arXiv:2410.16184, 2024b. Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

  17. [17]

    RewardBench 2: Advancing Reward Model Evaluation

    URL https://arxiv.org/abs/2506.01937. Mutti, M., De Santi, R., De Bartolomeis, P., and Restelli, M. Convex reinforcement learning in finite trials.Journal of Machine Learning Research, 24(250):1–42,

  18. [18]

    Olmo 3

    Olmo, T., Ettinger, A., Bertsch, A., Kuehl, B., Graham, D., Heineman, D., Groeneveld, D., Brahman, F., Tim- bers, F., Ivison, H., et al. Olmo 3.arXiv preprint arXiv:2512.13961,

  19. [19]

    RouteLLM: Learning to Route LLMs with Preference Data

    Ong, I., Almahairi, A., Wu, V ., Chiang, W.-L., Wu, T., Gon- zalez, J. E., Kadous, M. W., and Stoica, I. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665,

  20. [20]

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al

    Accessed: 2025-12-28. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744,

  21. [21]

    and MEHROTRA, S

    Rahimian, H. and Mehrotra, S. Distributionally robust op- timization: A review.arXiv preprint arXiv:1908.05659,

  22. [22]

    Learning to plan & reason for evaluation with thinking-llm-as-a-judge, 2025

    Saha, S., Li, X., Ghazvininejad, M., Weston, J., and Wang, T. Learning to plan & reason for evaluation with thinking- llm-as-a-judge.arXiv preprint arXiv:2501.18099,

  23. [23]

    R., Paige, B., and Bo- gunovic, I

    Son, S., Bankes, W., Chowdhury, S. R., Paige, B., and Bo- gunovic, I. Right now, wrong then: Non-stationary direct preference optimization under preference drift.arXiv preprint arXiv:2407.18676,

  24. [24]

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    Sui, Y ., Chuang, Y .-N., Wang, G., Zhang, J., Zhang, T., Yuan, J., Liu, H., Wen, A., Zhong, S., Zou, N., et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,

  25. [25]

    Judgebench: A benchmark for evaluating llm-based judges,

    Tan, S., Zhuang, S., Montgomery, K., Tang, W. Y ., Cuadron, A., Wang, C., Popa, R. A., and Stoica, I. Judgebench: A benchmark for evaluating llm-based judges, 2024a. URL https://arxiv.org/abs/2410.12784. Tan, S., Zhuang, S., Montgomery, K., Tang, W. Y ., Cuadron, A., Wang, C., Popa, R. A., and Stoica, I. Judgebench: A benchmark for evaluating llm-based ju...

  26. [26]

    Whitehouse, T

    Whitehouse, C., Wang, T., Yu, P., Li, X., Weston, J., Ku- likov, I., and Saha, S. J1: Incentivizing thinking in llm- as-a-judge via reinforcement learning.arXiv preprint arXiv:2505.10320,

  27. [27]

    Robust llm alignment via distribution- ally robust direct preference optimization.arXiv preprint arXiv:2502.01930,

    Xu, Z., Vemuri, S., Panaganti, K., Kalathil, D., Jain, R., and Ramachandran, D. Robust llm alignment via distribution- ally robust direct preference optimization.arXiv preprint arXiv:2502.01930,

  28. [28]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  29. [29]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  30. [30]

    Synapseroute: An auto- route switching framework on dual-state large language model.arXiv preprint arXiv:2507.02822, 2025a

    Zhang, W., Qiao, S., Luo, L., Li, Y ., Zheng, C., Xu, Q., Li, M., Gui, Y ., He, Y ., Qiu, J., et al. Synapseroute: An auto- route switching framework on dual-state large language model.arXiv preprint arXiv:2507.02822, 2025a. Zhang, Y ., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J., et al. Qwen3 embed- ding: Advancing...

  31. [31]

    12 Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge Appendix The appendix provides additional theoretical analysis, experimental details, and supplementary experimental results. In Section A, we establish the existence and uniqueness of the saddle point and prove last-iterate convergence, providing theoretical guarantees fo...

  32. [32]

    Step 1.We first derive the closed-form solution to (6). According to (6), we optimize the following objective max π∈Π E(z,l)∼ρ,a∼π(·|z)[r(z, a, l)]−λ tE(z,l)∼ρ,a∼π(·|z)[c(z, a)] +βE (z,l)∼ρ [H(π(· |z))].(A.11) 16 Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge Analogously to arguments used in the proof of DPO (Rafailov et ...

  33. [33]

    and CODE-PREFERENCE-PAIRS (Vezora, 2024). As shown in the table C.2, RACER consistently outperforms random routing at matched budgets and often surpasses all-reasoning accuracy at lower cost, showing that RACER has generalization with respect to the model family. C.4. Analysis of Self-Routing Behavior Our approach learns the routing policy from data. As a...