pith. machine review for the scientific record. sign in

arxiv: 2605.12001 · v1 · submitted 2026-05-12 · 💻 cs.IT · cs.AI· math.IT

Recognition: 2 theorem links

· Lean Theorem

CR²: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference

Jiangchao Yao, Meixia Tao, Nan Xue, Shengkang Chen, Yaping Sun, Zhiyong Chen, Zixia Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:51 UTC · model grok-4.3

classification 💻 cs.IT cs.AImath.IT
keywords LLM routingdevice-edge inferenceconformal risk controlcost-aware routingwireless edge computingPareto frontiermargin gaterisk calibration
0
0 comments X

The pith

CR^2 routes LLM queries between device and edge to match full-information accuracy at lower cost using only local signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper formulates wireless device-edge LLM routing as a cost-aware decision problem under resource constraints. It introduces CR^2, which separates a lightweight device margin gate from an edge utility selector and calibrates decisions with conformal risk control to bound false-acceptance risk. The result is a router that uses only device-side signals yet performs close to a reference with complete information. Readers should care because it addresses real wireless overheads like latency and energy that cloud-focused routers ignore, leading to better accuracy-cost trade-offs in mobile deployments.

Core claim

CR^2 decouples a lightweight on-device margin gate from an edge-side utility selector for deferred queries. The margin gate operates on frozen query embeddings and a user-specified cost weight to predict whether local execution is utility-optimal relative to the best edge alternative. A conformal risk control calibration procedure maps each operating point to an acceptance threshold, enabling explicit control of the marginal false-acceptance risk under the full-information utility reference. Experiments show that CR^2 closely matches a full-information reference router using only device-side signals before deferral and reduces normalized deployment cost by up to 16.9% at matched accuracy.

What carries the argument

The margin gate, which predicts local execution optimality relative to edge alternatives based on cost weight and frozen embeddings, combined with CRC calibration for risk-controlled thresholds.

If this is right

  • CR^2 consistently improves the deployable accuracy-cost Pareto frontier compared with strong query-level baselines.
  • It reduces normalized deployment cost by up to 16.9% at matched accuracy.
  • It enables explicit control of the marginal false-acceptance risk under the full-information utility reference.
  • It achieves near full-information performance while relying solely on device-side signals before any deferral.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The two-stage separation of device gate and edge selector may simplify updates when edge models change independently of on-device hardware.
  • Similar margin-gate plus conformal calibration patterns could extend to other deferral tasks in edge computing if the utility reference remains stable.
  • Lowering cost at fixed accuracy may increase the number of queries that can run safely on battery-constrained devices without edge fallback.

Load-bearing premise

The full-information utility reference used for CRC calibration accurately represents real deployment conditions and the margin gate generalizes across queries and operating points without significant distribution shift.

What would settle it

Measure the realized false-acceptance rate and actual deployment cost in live wireless tests against the calibrated bounds and check whether the 16.9% cost reduction at matched accuracy still holds.

Figures

Figures reproduced from arXiv: 2605.12001 by Jiangchao Yao, Meixia Tao, Nan Xue, Shengkang Chen, Yaping Sun, Zhiyong Chen, Zixia Hu.

Figure 1
Figure 1. Figure 1: Two-tier device-edge inference and routing flow. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CR2 including offline training, CRC-based calibration, and online two-stage routing. A. Two-Component Decision Rule For each query x, let e(x) = femb(x) ∈ R d (30) denote the output of a frozen query encoder femb. The teacher selector fθ produces per-model correctness estimates {pm(x)}m∈M ⊂ [0, 1]. (31) In CR2 , the UE-observable argument in (24) is instantiated by the local query embedding tog… view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy–cost Pareto curves: (a) full range and (b) zoomed operating [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Fixed-accuracy cost comparison. multi-step reasoning, graduate-level science, and code gen￾eration. Model-wise correctness labels are obtained using lm-evaluation-harness [47], following the dataset construction setting of EmbedLLM [38]. The validation split is used for operating-point calibration, and the test split is held out for routing and calibration-risk evaluation under the same runtime-state sampl… view at source ↗
Figure 6
Figure 6. Figure 6: Marginal false-acceptance rate under CRC-calibrated thresholds. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Local-model selection rate. TABLE II PER-BENCHMARK ACCURACY AT REPRESENTATIVE COST TARGETS. BLOCKS USE NEAREST REACHABLE OPERATING POINT; “–” DENOTES AN UNREACHABLE TARGET. AVG. IS OVER ALL TEST QUERIES. Method MMLU BBH GPQA MBPP Avg c¯ = 0.35 MLP 0.762 0.850 0.624 0.718 0.738 KNN 0.806 0.872 0.518 0.821 0.754 EmbedLLM 0.799 0.857 0.533 0.821 0.752 LLMRank 0.802 0.868 0.515 0.769 0.738 CR2 (device-edge) 0.… view at source ↗
Figure 9
Figure 9. Figure 9: Gate-error decomposition. F. Router Complexity Measurement Table IV summarizes the inference complexity of the router heads, excluding the shared frozen all-MiniLM-L6-v2 encoder. This isolates the per-query decision overhead incurred after the query embedding has been computed. For CR2 (device-edge), the reported Params and FLOPs correspond to the first-stage margin gate only. The non-parametric KNN router… view at source ↗
read the original abstract

As large language models (LLMs) move from centralized clouds to mobile edge environments, efficient serving must balance latency, energy consumption, and accuracy under constrained device-edge resources. Query-level routing between lightweight on-device models and stronger edge models provides a flexible mechanism to navigate this trade-off. However, existing routers are designed for centralized cloud settings and optimize token-level costs, failing to capture the dynamic latency and energy overheads in wireless edge deployments. In this paper, we formulate mobile edge LLM routing as a deployment-constrained, cost-aware decision problem, and propose CR^2, a two-stage device-edge routing framework. CR^2 decouples a lightweight on-device margin gate from an edge-side utility selector for deferred queries. The margin gate operates on frozen query embeddings and a user-specified cost weight to predict whether local execution is utility-optimal relative to the best edge alternative under the target operating point. We further introduce a conformal risk control (CRC) calibration procedure that maps each operating point to an acceptance threshold, enabling explicit control of the marginal false-acceptance risk under the full-information utility reference. Experiments on the routing task show that CR^2 closely matches a full-information reference router using only device-side signals before deferral. Compared with strong query-level baselines, CR^2 consistently improves the deployable accuracy-cost Pareto frontier and reduces normalized deployment cost by up to 16.9% at matched accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CR², a two-stage device-edge routing framework for wireless LLM inference. A lightweight on-device margin gate operates on frozen query embeddings and a user-specified cost weight to decide local execution versus deferral; an edge-side utility selector handles deferred queries. Conformal risk control (CRC) is used to calibrate acceptance thresholds against a full-information utility reference, with the goal of controlling marginal false-acceptance risk. Experiments claim that CR² closely matches the full-information reference router while using only device-side signals, improves the deployable accuracy-cost Pareto frontier over strong baselines, and reduces normalized deployment cost by up to 16.9% at matched accuracy.

Significance. If the empirical claims hold under realistic wireless conditions, the work provides a practical mechanism for cost-aware routing with explicit risk guarantees via CRC. The decoupling of the device-side gate from the edge selector and the use of CRC for tunable operating points are strengths that could aid deployment under resource constraints. The paper supplies falsifiable predictions through its Pareto-frontier comparisons and cost-reduction figures.

major comments (2)
  1. [CRC calibration procedure] CRC calibration procedure: the procedure maps device-side margin-gate outputs to thresholds using a full-information utility reference that incorporates both device and edge model outcomes plus exact costs. The manuscript provides no sensitivity analysis or experiments incorporating stochastic wireless channel traces (variable transmission latency or energy draw), which risks violating the exchangeability assumption needed for the marginal coverage guarantee to transfer to real deployments. This directly underpins the central claim that CR² 'closely matches' the reference router.
  2. [Experimental results] Experimental results: the headline 16.9% normalized cost reduction and Pareto-frontier improvements are reported without error bars, confidence intervals, query-split details, or statistical significance tests. It is therefore impossible to determine whether the gains are robust or could arise from post-hoc threshold selection.
minor comments (2)
  1. [Abstract] The abstract refers to 'strong query-level baselines' without naming them; explicit identification would allow readers to assess the strength of the comparison.
  2. [Methods] Notation for the margin gate, utility selector, and CRC threshold mapping would be clearer if introduced with explicit equations at the start of the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and agree to incorporate additional analysis and statistical reporting to strengthen the manuscript.

read point-by-point responses
  1. Referee: [CRC calibration procedure] CRC calibration procedure: the procedure maps device-side margin-gate outputs to thresholds using a full-information utility reference that incorporates both device and edge model outcomes plus exact costs. The manuscript provides no sensitivity analysis or experiments incorporating stochastic wireless channel traces (variable transmission latency or energy draw), which risks violating the exchangeability assumption needed for the marginal coverage guarantee to transfer to real deployments. This directly underpins the central claim that CR² 'closely matches' the reference router.

    Authors: We appreciate the referee's emphasis on the exchangeability assumption underlying conformal risk control. Our calibration uses a full-information utility reference computed from average device and edge costs under the target operating point, which preserves exchangeability with respect to the query distribution in our experimental setup. We acknowledge that real-world stochastic channel variations (e.g., latency jitter) could affect empirical coverage. In the revision we will add a dedicated sensitivity analysis subsection that injects stochastic wireless traces (Rayleigh fading for transmission energy/latency) into the utility reference and reports the resulting marginal coverage rates. This will directly support the robustness of the 'closely matches' claim under more realistic conditions. revision: yes

  2. Referee: [Experimental results] Experimental results: the headline 16.9% normalized cost reduction and Pareto-frontier improvements are reported without error bars, confidence intervals, query-split details, or statistical significance tests. It is therefore impossible to determine whether the gains are robust or could arise from post-hoc threshold selection.

    Authors: We agree that error bars, confidence intervals, and significance testing are necessary to establish robustness. The reported 16.9% figure is the maximum improvement observed across operating points; the original experiments used fixed train/test splits without repeated sampling. In the revised manuscript we will (i) report mean and standard deviation of cost reduction and Pareto metrics over five random query splits, (ii) add error bars to all Pareto-frontier and cost-reduction plots, and (iii) include paired statistical tests (t-test or Wilcoxon signed-rank) against baselines to confirm that improvements are not attributable to post-hoc threshold choice. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external reference and standard CRC.

full rationale

The paper trains a device-side margin gate on frozen embeddings to predict utility-optimality labels derived from a full-information reference (device + edge outcomes + costs). It then applies standard conformal risk control (CRC) on a calibration set to set acceptance thresholds that guarantee marginal false-acceptance risk w.r.t. that same reference. The headline claim of 'closely matches' is an empirical comparison on held-out queries, not a definitional identity or a fitted parameter renamed as prediction. CRC coverage is a known property independent of the paper's fitted values; the reference is external to the device signals used at inference. No self-citation load-bearing step, no ansatz smuggled via prior work, and no reduction of the Pareto improvement to the inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework depends on the validity of conformal risk control for marginal coverage and on the existence of a computable full-information utility reference; no new entities are postulated.

free parameters (2)
  • user-specified cost weight
    Controls the operating point of the margin gate; chosen by the user rather than learned from data.
  • CRC acceptance threshold
    Derived from calibration on the full-information reference; maps each operating point to a risk-controlled decision boundary.
axioms (1)
  • standard math Conformal risk control guarantees marginal coverage under exchangeability
    Invoked to ensure the false-acceptance risk stays below the target level.

pith-pipeline@v0.9.0 · 5578 in / 1358 out tokens · 63489 ms · 2026-05-13T04:51:17.793095+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 4 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiamet al., “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Mobile edge intelligence for large language models: A contemporary survey,

    G. Qu, Q. Chen, W. Wei, Z. Lin, X. Chen, and K. Huang, “Mobile edge intelligence for large language models: A contemporary survey,”IEEE Commun. Surveys Tuts., vol. 27, no. 6, pp. 3820–3860, 2025

  3. [3]

    Qwen3 Technical Report

    A. Yanget al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  4. [4]

    [Online]

    Ollama, “Ollama,” GitHub, 2026, accessed: May 11, 2026. [Online]. Available: https://github.com/ollama/ollama

  5. [5]

    EdgeShard: Efficient LLM inference via collaborative edge computing,

    M. Zhang, X. Shen, J. Cao, Z. Cui, and S. Jiang, “EdgeShard: Efficient LLM inference via collaborative edge computing,”IEEE Internet Things J., vol. 12, no. 10, pp. 13 119–13 131, 2024

  6. [6]

    CE-CoLLM: Efficient and adaptive large language models through cloud-edge collaboration,

    H. Jin and Y . Wu, “CE-CoLLM: Efficient and adaptive large language models through cloud-edge collaboration,” inProc. IEEE Int. Conf. Web Services (ICWS), 2025, pp. 316–323

  7. [7]

    Serving long-context LLMs at the mobile edge: Test-time reinforcement learning-based model caching and inference offloading,

    M. Xu, D. Niyato, and C. G. Brinton, “Serving long-context LLMs at the mobile edge: Test-time reinforcement learning-based model caching and inference offloading,”IEEE Trans. Netw., 2026

  8. [8]

    Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,

    Y . Kanget al., “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,”ACM SIGARCH Comput. Archit. News, vol. 45, no. 1, pp. 615–629, 2017

  9. [9]

    SPINN: Synergistic progressive inference of neural networks over device and cloud,

    S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D. Lane, “SPINN: Synergistic progressive inference of neural networks over device and cloud,” inProc. ACM MobiCom, 2020, pp. 1–15

  10. [10]

    Edge intelligence: On-demand deep learning model co-inference with device-edge synergy,

    E. Li, Z. Zhou, and X. Chen, “Edge intelligence: On-demand deep learning model co-inference with device-edge synergy,” inProc. ACM SIGCOMM Workshop Mobile Edge Commun., 2018, pp. 31–36

  11. [11]

    Fast inference from transform- ers via speculative decoding,

    Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transform- ers via speculative decoding,” inProc. Int. Conf. Mach. Learn. (ICML), 2023, pp. 19 274–19 286

  12. [12]

    Sequoia: Scalable, robust, and hardware-aware specu- lative decoding,

    Z. Chenet al., “Sequoia: Scalable, robust, and hardware-aware specu- lative decoding,”arXiv preprint arXiv:2402.12374, 2024

  13. [13]

    EAGLE: Speculative sampling requires rethinking feature uncertainty,

    Y . Li, F. Wei, C. Zhang, and H. Zhang, “EAGLE: Speculative sampling requires rethinking feature uncertainty,” inProc. Int. Conf. Mach. Learn. (ICML), 2024, pp. 28 935–28 948. 13

  14. [14]

    Conformal Risk Control,

    A. N. Angelopoulos, S. Bates, A. Fisch, L. Lei, and T. Schuster, “Conformal Risk Control,” inProc. Int. Conf. Learn. Represent. (ICLR), 2024

  15. [15]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

    W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, no. 120, pp. 1–39, 2022

  16. [16]

    Mixture-of-experts with expert choice routing,

    Y . Zhouet al., “Mixture-of-experts with expert choice routing,”Adv. Neural Inf. Process. Syst., vol. 35, pp. 7103–7114, 2022

  17. [17]

    R2-T2: Re-routing in test-time for multimodal mixture-of-experts,

    Z. Li, Z. Li, and T. Zhou, “R2-T2: Re-routing in test-time for multimodal mixture-of-experts,” inProc. Int. Conf. Mach. Learn. (ICML), 2025, pp. 35 292–35 316

  18. [18]

    Routing to the expert: Efficient reward-guided ensemble of large language models,

    K. Luet al., “Routing to the expert: Efficient reward-guided ensemble of large language models,” inProc. Conf. North Amer . Chapter Assoc. Comput. Linguistics: Human Lang. Technol. (NAACL-HLT), 2024, pp. 1964–1974

  19. [19]

    TriSpec: Ternary speculative decoding via lightweight proxy verification,

    H. Jianget al., “TriSpec: Ternary speculative decoding via lightweight proxy verification,”arXiv preprint arXiv:2601.23180, 2026

  20. [20]

    RouteLLM: Learning to route LLMs from preference data,

    I. Onget al., “RouteLLM: Learning to route LLMs from preference data,” inProc. Int. Conf. Learn. Represent. (ICLR), 2025

  21. [21]

    Hybrid LLM: Cost-efficient and quality-aware query routing,

    D. Dinget al., “Hybrid LLM: Cost-efficient and quality-aware query routing,” inProc. Int. Conf. Learn. Represent. (ICLR), 2024

  22. [22]

    GraphRouter: A graph-based router for LLM selections,

    T. Feng, Y . Shen, and J. You, “GraphRouter: A graph-based router for LLM selections,” inProc. Int. Conf. Learn. Represent. (ICLR), 2025

  23. [23]

    TensorOpera router: A multi-model router for efficient LLM inference,

    D. Stripeliset al., “TensorOpera router: A multi-model router for efficient LLM inference,” inProc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), Industry Track, 2024, pp. 452–462

  24. [24]

    BEST-Route: Adaptive LLM routing with test-time optimal compute,

    D. Dinget al., “BEST-Route: Adaptive LLM routing with test-time optimal compute,” inProc. Int. Conf. Mach. Learn. (ICML), 2025, pp. 13 870–13 884

  25. [25]

    Dynamic quality-latency aware routing for LLM inference in wireless edge-device networks,

    R. Bao, N. Xue, Y . Sun, and Z. Chen, “Dynamic quality-latency aware routing for LLM inference in wireless edge-device networks,” inProc. IEEE/CIC Int. Conf. Commun. China Workshops (ICCC Workshops), 2025, pp. 1–6

  26. [26]

    Smoothie: Label free language model routing,

    N. Guha, M. F. Chen, T. Chow, I. S. Khare, and C. Re, “Smoothie: Label free language model routing,”Adv. Neural Inf. Process. Syst., vol. 37, pp. 127 645–127 672, 2024

  27. [27]

    Capability instruction tuning,

    Y .-K. Zhang, D.-C. Zhan, and H.-J. Ye, “Capability instruction tuning,” inProc. AAAI Conf. Artif. Intell., vol. 39, no. 24, 2025, pp. 25 958– 25 966

  28. [28]

    FrugalGPT: How to use large language models while reducing cost and improving performance,

    L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,”Trans. Mach. Learn. Res., 2024

  29. [29]

    Quality-of- service aware LLM routing for edge computing with multiple experts,

    J. Yang, Q. Wu, Z. Feng, Z. Zhou, D. Guo, and X. Chen, “Quality-of- service aware LLM routing for edge computing with multiple experts,” IEEE Trans. Mobile Comput., vol. 24, no. 12, pp. 13 648–13 662, 2025

  30. [30]

    EdgeBERT: Sentence-level energy optimizations for latency-aware multi-task NLP inference,

    T. Tambeet al., “EdgeBERT: Sentence-level energy optimizations for latency-aware multi-task NLP inference,” inProc. IEEE/ACM Int. Symp. Microarchitecture (MICRO), 2021, pp. 830–844

  31. [31]

    SlimCaching: Edge caching of mixture-of-experts for distributed inference,

    Q. Chen, X. Chen, and K. Huang, “SlimCaching: Edge caching of mixture-of-experts for distributed inference,”IEEE Trans. Mobile Com- put., pp. 1–15, 2026

  32. [32]

    WDMoE: Wireless distributed mixture of experts for large language models,

    N. Xueet al., “WDMoE: Wireless distributed mixture of experts for large language models,”IEEE Trans. Wireless Commun., 2025

  33. [33]

    Stable-MoE: Lyapunov-based token routing for distributed mixture-of-experts training over edge networks,

    L. Shi, B. Ou, K. Wei, W. Zhu, Z. Wang, and Z. Chen, “Stable-MoE: Lyapunov-based token routing for distributed mixture-of-experts training over edge networks,”IEEE Trans. V eh. Technol., 2026

  34. [34]

    CSGO: Generalized optimization for cold start in wireless collaborative edge LLM systems,

    X. Liuet al., “CSGO: Generalized optimization for cold start in wireless collaborative edge LLM systems,”J. Commun. Inf. Netw., vol. 10, no. 4, pp. 340–351, 2025

  35. [35]

    WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference

    Z. Liuet al., “WISV: Wireless-informed semantic verification for distributed speculative decoding in device-edge LLM inference,”arXiv preprint arXiv:2604.17701, 2026

  36. [36]

    MoE 2: Optimizing collaborative inference for edge large language models,

    L. Jinet al., “MoE 2: Optimizing collaborative inference for edge large language models,”IEEE Trans. Netw., vol. 34, pp. 4637–4651, 2026

  37. [37]

    Large language model-empowered resource allocation in intent-driven wireless networks,

    H. Sunet al., “Large language model-empowered resource allocation in intent-driven wireless networks,”IEEE Trans. Cogn. Commun. Netw., vol. 12, pp. 6265–6280, 2026

  38. [38]

    Em- bedLLM: Learning compact representations of large language models,

    R. Zhuang, T. Wu, Z. Wen, A. Li, J. Jiao, and K. Ramchandran, “Em- bedLLM: Learning compact representations of large language models,” inProc. Int. Conf. Learn. Represent. (ICLR), 2025

  39. [39]

    RouterDC: Query- based router by dual contrastive learning for assembling large language models,

    S. Chen, W. Jiang, B. Lin, J. Kwok, and Y . Zhang, “RouterDC: Query- based router by dual contrastive learning for assembling large language models,”Adv. Neural Inf. Process. Syst., vol. 37, pp. 66 305–66 328, 2024

  40. [40]

    RadialRouter: Structured representation for efficient and robust large language models routing,

    R. Jinet al., “RadialRouter: Structured representation for efficient and robust large language models routing,” inFindings Assoc. Comput. Linguistics: EMNLP, 2025, pp. 14 587–14 600

  41. [41]

    V ovk, A

    V . V ovk, A. Gammerman, and G. Shafer,Algorithmic Learning in a Random World. Springer, 2005

  42. [42]

    Learn then test: Calibrating predictive algorithms to achieve risk control,

    A. N. Angelopoulos, S. Bates, E. J. Cand `es, M. I. Jordan, and L. Lei, “Learn then test: Calibrating predictive algorithms to achieve risk control,”Ann. Appl. Stat., vol. 19, no. 2, pp. 1641–1662, 2025

  43. [43]

    Measuring massive multitask language understand- ing,

    D. Hendryckset al., “Measuring massive multitask language understand- ing,” inProc. Int. Conf. Learn. Represent. (ICLR), 2021

  44. [44]

    Challenging BIG-Bench tasks and whether chain-of- thought can solve them,

    M. Suzgunet al., “Challenging BIG-Bench tasks and whether chain-of- thought can solve them,” inFindings Assoc. Comput. Linguistics: ACL, 2023, pp. 13 003–13 051

  45. [45]

    GPQA: A graduate-level Google-proof Q&A bench- mark,

    D. Reinet al., “GPQA: A graduate-level Google-proof Q&A bench- mark,” inProc. First Conf. Lang. Model. (COLM), 2024

  46. [46]

    Program Synthesis with Large Language Models

    J. Austinet al., “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021

  47. [47]

    Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro

    L. Gaoet al., “The Language Model Evaluation Harness,” Jul. 2024. [Online]. Available: https://zenodo.org/records/12608602

  48. [48]

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T

    Q. J. Huet al., “RouterBench: A benchmark for multi-LLM routing system,”arXiv preprint arXiv:2403.12031, 2024

  49. [49]

    LLMRank: Understanding LLM strengths for model routing,

    S. Agrawal and P. Gupta, “LLMRank: Understanding LLM strengths for model routing,”arXiv preprint arXiv:2510.01234, 2025