arxiv: 2605.12001 · v1 · submitted 2026-05-12 · 💻 cs.IT · cs.AI· math.IT

Recognition: 2 theorem links

· Lean Theorem

CR²: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference

Jiangchao Yao, Meixia Tao, Nan Xue, Shengkang Chen, Yaping Sun, Zhiyong Chen, Zixia Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:51 UTC · model grok-4.3

classification 💻 cs.IT cs.AImath.IT

keywords LLM routingdevice-edge inferenceconformal risk controlcost-aware routingwireless edge computingPareto frontiermargin gaterisk calibration

0 comments

The pith

CR^2 routes LLM queries between device and edge to match full-information accuracy at lower cost using only local signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper formulates wireless device-edge LLM routing as a cost-aware decision problem under resource constraints. It introduces CR^2, which separates a lightweight device margin gate from an edge utility selector and calibrates decisions with conformal risk control to bound false-acceptance risk. The result is a router that uses only device-side signals yet performs close to a reference with complete information. Readers should care because it addresses real wireless overheads like latency and energy that cloud-focused routers ignore, leading to better accuracy-cost trade-offs in mobile deployments.

Core claim

CR^2 decouples a lightweight on-device margin gate from an edge-side utility selector for deferred queries. The margin gate operates on frozen query embeddings and a user-specified cost weight to predict whether local execution is utility-optimal relative to the best edge alternative. A conformal risk control calibration procedure maps each operating point to an acceptance threshold, enabling explicit control of the marginal false-acceptance risk under the full-information utility reference. Experiments show that CR^2 closely matches a full-information reference router using only device-side signals before deferral and reduces normalized deployment cost by up to 16.9% at matched accuracy.

What carries the argument

The margin gate, which predicts local execution optimality relative to edge alternatives based on cost weight and frozen embeddings, combined with CRC calibration for risk-controlled thresholds.

If this is right

CR^2 consistently improves the deployable accuracy-cost Pareto frontier compared with strong query-level baselines.
It reduces normalized deployment cost by up to 16.9% at matched accuracy.
It enables explicit control of the marginal false-acceptance risk under the full-information utility reference.
It achieves near full-information performance while relying solely on device-side signals before any deferral.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The two-stage separation of device gate and edge selector may simplify updates when edge models change independently of on-device hardware.
Similar margin-gate plus conformal calibration patterns could extend to other deferral tasks in edge computing if the utility reference remains stable.
Lowering cost at fixed accuracy may increase the number of queries that can run safely on battery-constrained devices without edge fallback.

Load-bearing premise

The full-information utility reference used for CRC calibration accurately represents real deployment conditions and the margin gate generalizes across queries and operating points without significant distribution shift.

What would settle it

Measure the realized false-acceptance rate and actual deployment cost in live wireless tests against the calibrated bounds and check whether the 16.9% cost reduction at matched accuracy still holds.

Figures

Figures reproduced from arXiv: 2605.12001 by Jiangchao Yao, Meixia Tao, Nan Xue, Shengkang Chen, Yaping Sun, Zhiyong Chen, Zixia Hu.

**Figure 2.** Figure 2: Overview of CR2 including offline training, CRC-based calibration, and online two-stage routing. A. Two-Component Decision Rule For each query x, let e(x) = femb(x) ∈ R d (30) denote the output of a frozen query encoder femb. The teacher selector fθ produces per-model correctness estimates {pm(x)}m∈M ⊂ [0, 1]. (31) In CR2 , the UE-observable argument in (24) is instantiated by the local query embedding tog… view at source ↗

**Figure 3.** Figure 3: Accuracy–cost Pareto curves: (a) full range and (b) zoomed operating [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Fixed-accuracy cost comparison. multi-step reasoning, graduate-level science, and code generation. Model-wise correctness labels are obtained using lm-evaluation-harness [47], following the dataset construction setting of EmbedLLM [38]. The validation split is used for operating-point calibration, and the test split is held out for routing and calibration-risk evaluation under the same runtime-state sampl… view at source ↗

**Figure 6.** Figure 6: Marginal false-acceptance rate under CRC-calibrated thresholds. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Local-model selection rate. TABLE II PER-BENCHMARK ACCURACY AT REPRESENTATIVE COST TARGETS. BLOCKS USE NEAREST REACHABLE OPERATING POINT; “–” DENOTES AN UNREACHABLE TARGET. AVG. IS OVER ALL TEST QUERIES. Method MMLU BBH GPQA MBPP Avg c¯ = 0.35 MLP 0.762 0.850 0.624 0.718 0.738 KNN 0.806 0.872 0.518 0.821 0.754 EmbedLLM 0.799 0.857 0.533 0.821 0.752 LLMRank 0.802 0.868 0.515 0.769 0.738 CR2 (device-edge) 0.… view at source ↗

**Figure 9.** Figure 9: Gate-error decomposition. F. Router Complexity Measurement Table IV summarizes the inference complexity of the router heads, excluding the shared frozen all-MiniLM-L6-v2 encoder. This isolates the per-query decision overhead incurred after the query embedding has been computed. For CR2 (device-edge), the reported Params and FLOPs correspond to the first-stage margin gate only. The non-parametric KNN router… view at source ↗

read the original abstract

As large language models (LLMs) move from centralized clouds to mobile edge environments, efficient serving must balance latency, energy consumption, and accuracy under constrained device-edge resources. Query-level routing between lightweight on-device models and stronger edge models provides a flexible mechanism to navigate this trade-off. However, existing routers are designed for centralized cloud settings and optimize token-level costs, failing to capture the dynamic latency and energy overheads in wireless edge deployments. In this paper, we formulate mobile edge LLM routing as a deployment-constrained, cost-aware decision problem, and propose CR^2, a two-stage device-edge routing framework. CR^2 decouples a lightweight on-device margin gate from an edge-side utility selector for deferred queries. The margin gate operates on frozen query embeddings and a user-specified cost weight to predict whether local execution is utility-optimal relative to the best edge alternative under the target operating point. We further introduce a conformal risk control (CRC) calibration procedure that maps each operating point to an acceptance threshold, enabling explicit control of the marginal false-acceptance risk under the full-information utility reference. Experiments on the routing task show that CR^2 closely matches a full-information reference router using only device-side signals before deferral. Compared with strong query-level baselines, CR^2 consistently improves the deployable accuracy-cost Pareto frontier and reduces normalized deployment cost by up to 16.9% at matched accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CR^2 adapts conformal risk control to wireless device-edge LLM routing with a two-stage margin gate and utility selector, but the calibration's dependence on a full-information reference leaves its robustness to real channel variability unverified.

read the letter

The main takeaway is that this paper adapts conformal risk control to wireless device-edge LLM routing, using a lightweight margin gate to decide local versus deferred execution while bounding the risk of missing better edge options. It reports solid empirical gains over baselines. The work does a good job framing the routing as a cost-aware problem with user-specified weights and decoupling the device-side prediction from edge-side selection. Applying CRC to map operating points to thresholds is a clean way to get explicit risk control without heavy computation on the device. The claim of closely matching full-information performance with only device signals is the key selling point, and the 16.9% cost reduction at matched accuracy would be useful if it holds. The soft spots are around the calibration reference. The CRC procedure relies on a full-information utility that knows both models' outcomes and exact costs. In wireless settings, latency and energy are stochastic due to channel conditions, so if the reference was not built from real traces, the exchangeability assumption could fail in deployment. The abstract gives no sign of sensitivity tests for this mismatch, which is the central risk for the guarantees. Soundness is hard to judge fully from the abstract alone, but the lack of error bars or detailed splits raises the usual questions about robustness. This paper is for engineers and researchers working on practical edge LLM serving and routing algorithms. Anyone dealing with accuracy-cost tradeoffs in resource-constrained wireless environments could pick up the two-stage CRC idea. It deserves serious peer review. The core technique is new in this context and the results are specific enough to be worth referee scrutiny, even with the need for more validation on the reference utility.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CR², a two-stage device-edge routing framework for wireless LLM inference. A lightweight on-device margin gate operates on frozen query embeddings and a user-specified cost weight to decide local execution versus deferral; an edge-side utility selector handles deferred queries. Conformal risk control (CRC) is used to calibrate acceptance thresholds against a full-information utility reference, with the goal of controlling marginal false-acceptance risk. Experiments claim that CR² closely matches the full-information reference router while using only device-side signals, improves the deployable accuracy-cost Pareto frontier over strong baselines, and reduces normalized deployment cost by up to 16.9% at matched accuracy.

Significance. If the empirical claims hold under realistic wireless conditions, the work provides a practical mechanism for cost-aware routing with explicit risk guarantees via CRC. The decoupling of the device-side gate from the edge selector and the use of CRC for tunable operating points are strengths that could aid deployment under resource constraints. The paper supplies falsifiable predictions through its Pareto-frontier comparisons and cost-reduction figures.

major comments (2)

[CRC calibration procedure] CRC calibration procedure: the procedure maps device-side margin-gate outputs to thresholds using a full-information utility reference that incorporates both device and edge model outcomes plus exact costs. The manuscript provides no sensitivity analysis or experiments incorporating stochastic wireless channel traces (variable transmission latency or energy draw), which risks violating the exchangeability assumption needed for the marginal coverage guarantee to transfer to real deployments. This directly underpins the central claim that CR² 'closely matches' the reference router.
[Experimental results] Experimental results: the headline 16.9% normalized cost reduction and Pareto-frontier improvements are reported without error bars, confidence intervals, query-split details, or statistical significance tests. It is therefore impossible to determine whether the gains are robust or could arise from post-hoc threshold selection.

minor comments (2)

[Abstract] The abstract refers to 'strong query-level baselines' without naming them; explicit identification would allow readers to assess the strength of the comparison.
[Methods] Notation for the margin gate, utility selector, and CRC threshold mapping would be clearer if introduced with explicit equations at the start of the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and agree to incorporate additional analysis and statistical reporting to strengthen the manuscript.

read point-by-point responses

Referee: [CRC calibration procedure] CRC calibration procedure: the procedure maps device-side margin-gate outputs to thresholds using a full-information utility reference that incorporates both device and edge model outcomes plus exact costs. The manuscript provides no sensitivity analysis or experiments incorporating stochastic wireless channel traces (variable transmission latency or energy draw), which risks violating the exchangeability assumption needed for the marginal coverage guarantee to transfer to real deployments. This directly underpins the central claim that CR² 'closely matches' the reference router.

Authors: We appreciate the referee's emphasis on the exchangeability assumption underlying conformal risk control. Our calibration uses a full-information utility reference computed from average device and edge costs under the target operating point, which preserves exchangeability with respect to the query distribution in our experimental setup. We acknowledge that real-world stochastic channel variations (e.g., latency jitter) could affect empirical coverage. In the revision we will add a dedicated sensitivity analysis subsection that injects stochastic wireless traces (Rayleigh fading for transmission energy/latency) into the utility reference and reports the resulting marginal coverage rates. This will directly support the robustness of the 'closely matches' claim under more realistic conditions. revision: yes
Referee: [Experimental results] Experimental results: the headline 16.9% normalized cost reduction and Pareto-frontier improvements are reported without error bars, confidence intervals, query-split details, or statistical significance tests. It is therefore impossible to determine whether the gains are robust or could arise from post-hoc threshold selection.

Authors: We agree that error bars, confidence intervals, and significance testing are necessary to establish robustness. The reported 16.9% figure is the maximum improvement observed across operating points; the original experiments used fixed train/test splits without repeated sampling. In the revised manuscript we will (i) report mean and standard deviation of cost reduction and Pareto metrics over five random query splits, (ii) add error bars to all Pareto-frontier and cost-reduction plots, and (iii) include paired statistical tests (t-test or Wilcoxon signed-rank) against baselines to confirm that improvements are not attributable to post-hoc threshold choice. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses external reference and standard CRC.

full rationale

The paper trains a device-side margin gate on frozen embeddings to predict utility-optimality labels derived from a full-information reference (device + edge outcomes + costs). It then applies standard conformal risk control (CRC) on a calibration set to set acceptance thresholds that guarantee marginal false-acceptance risk w.r.t. that same reference. The headline claim of 'closely matches' is an empirical comparison on held-out queries, not a definitional identity or a fitted parameter renamed as prediction. CRC coverage is a known property independent of the paper's fitted values; the reference is external to the device signals used at inference. No self-citation load-bearing step, no ansatz smuggled via prior work, and no reduction of the Pareto improvement to the inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework depends on the validity of conformal risk control for marginal coverage and on the existence of a computable full-information utility reference; no new entities are postulated.

free parameters (2)

user-specified cost weight
Controls the operating point of the margin gate; chosen by the user rather than learned from data.
CRC acceptance threshold
Derived from calibration on the full-information reference; maps each operating point to a risk-controlled decision boundary.

axioms (1)

standard math Conformal risk control guarantees marginal coverage under exchangeability
Invoked to ensure the false-acceptance risk stays below the target level.

pith-pipeline@v0.9.0 · 5578 in / 1358 out tokens · 63489 ms · 2026-05-13T04:51:17.793095+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We formulate mobile edge LLM routing as a deployment-constrained, cost-aware decision problem... conformal risk control (CRC) calibration procedure that maps each operating point to an acceptance threshold
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear
the scalarized utility um(x, ξ;λ) = ym(x) − λ cm(x, ξ)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 4 internal anchors

[1]

GPT-4 Technical Report

J. Achiamet al., “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Mobile edge intelligence for large language models: A contemporary survey,

G. Qu, Q. Chen, W. Wei, Z. Lin, X. Chen, and K. Huang, “Mobile edge intelligence for large language models: A contemporary survey,”IEEE Commun. Surveys Tuts., vol. 27, no. 6, pp. 3820–3860, 2025

work page 2025
[3]

Qwen3 Technical Report

A. Yanget al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

[Online]

Ollama, “Ollama,” GitHub, 2026, accessed: May 11, 2026. [Online]. Available: https://github.com/ollama/ollama

work page 2026
[5]

EdgeShard: Efficient LLM inference via collaborative edge computing,

M. Zhang, X. Shen, J. Cao, Z. Cui, and S. Jiang, “EdgeShard: Efficient LLM inference via collaborative edge computing,”IEEE Internet Things J., vol. 12, no. 10, pp. 13 119–13 131, 2024

work page 2024
[6]

CE-CoLLM: Efficient and adaptive large language models through cloud-edge collaboration,

H. Jin and Y . Wu, “CE-CoLLM: Efficient and adaptive large language models through cloud-edge collaboration,” inProc. IEEE Int. Conf. Web Services (ICWS), 2025, pp. 316–323

work page 2025
[7]

Serving long-context LLMs at the mobile edge: Test-time reinforcement learning-based model caching and inference offloading,

M. Xu, D. Niyato, and C. G. Brinton, “Serving long-context LLMs at the mobile edge: Test-time reinforcement learning-based model caching and inference offloading,”IEEE Trans. Netw., 2026

work page 2026
[8]

Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,

Y . Kanget al., “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,”ACM SIGARCH Comput. Archit. News, vol. 45, no. 1, pp. 615–629, 2017

work page 2017
[9]

SPINN: Synergistic progressive inference of neural networks over device and cloud,

S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D. Lane, “SPINN: Synergistic progressive inference of neural networks over device and cloud,” inProc. ACM MobiCom, 2020, pp. 1–15

work page 2020
[10]

Edge intelligence: On-demand deep learning model co-inference with device-edge synergy,

E. Li, Z. Zhou, and X. Chen, “Edge intelligence: On-demand deep learning model co-inference with device-edge synergy,” inProc. ACM SIGCOMM Workshop Mobile Edge Commun., 2018, pp. 31–36

work page 2018
[11]

Fast inference from transform- ers via speculative decoding,

Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transform- ers via speculative decoding,” inProc. Int. Conf. Mach. Learn. (ICML), 2023, pp. 19 274–19 286

work page 2023
[12]

Sequoia: Scalable, robust, and hardware-aware specu- lative decoding,

Z. Chenet al., “Sequoia: Scalable, robust, and hardware-aware specu- lative decoding,”arXiv preprint arXiv:2402.12374, 2024

work page arXiv 2024
[13]

EAGLE: Speculative sampling requires rethinking feature uncertainty,

Y . Li, F. Wei, C. Zhang, and H. Zhang, “EAGLE: Speculative sampling requires rethinking feature uncertainty,” inProc. Int. Conf. Mach. Learn. (ICML), 2024, pp. 28 935–28 948. 13

work page 2024
[14]

Conformal Risk Control,

A. N. Angelopoulos, S. Bates, A. Fisch, L. Lei, and T. Schuster, “Conformal Risk Control,” inProc. Int. Conf. Learn. Represent. (ICLR), 2024

work page 2024
[15]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, no. 120, pp. 1–39, 2022

work page 2022
[16]

Mixture-of-experts with expert choice routing,

Y . Zhouet al., “Mixture-of-experts with expert choice routing,”Adv. Neural Inf. Process. Syst., vol. 35, pp. 7103–7114, 2022

work page 2022
[17]

R2-T2: Re-routing in test-time for multimodal mixture-of-experts,

Z. Li, Z. Li, and T. Zhou, “R2-T2: Re-routing in test-time for multimodal mixture-of-experts,” inProc. Int. Conf. Mach. Learn. (ICML), 2025, pp. 35 292–35 316

work page 2025
[18]

Routing to the expert: Efficient reward-guided ensemble of large language models,

K. Luet al., “Routing to the expert: Efficient reward-guided ensemble of large language models,” inProc. Conf. North Amer . Chapter Assoc. Comput. Linguistics: Human Lang. Technol. (NAACL-HLT), 2024, pp. 1964–1974

work page 2024
[19]

TriSpec: Ternary speculative decoding via lightweight proxy verification,

H. Jianget al., “TriSpec: Ternary speculative decoding via lightweight proxy verification,”arXiv preprint arXiv:2601.23180, 2026

work page arXiv 2026
[20]

RouteLLM: Learning to route LLMs from preference data,

I. Onget al., “RouteLLM: Learning to route LLMs from preference data,” inProc. Int. Conf. Learn. Represent. (ICLR), 2025

work page 2025
[21]

Hybrid LLM: Cost-efficient and quality-aware query routing,

D. Dinget al., “Hybrid LLM: Cost-efficient and quality-aware query routing,” inProc. Int. Conf. Learn. Represent. (ICLR), 2024

work page 2024
[22]

GraphRouter: A graph-based router for LLM selections,

T. Feng, Y . Shen, and J. You, “GraphRouter: A graph-based router for LLM selections,” inProc. Int. Conf. Learn. Represent. (ICLR), 2025

work page 2025
[23]

TensorOpera router: A multi-model router for efficient LLM inference,

D. Stripeliset al., “TensorOpera router: A multi-model router for efficient LLM inference,” inProc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), Industry Track, 2024, pp. 452–462

work page 2024
[24]

BEST-Route: Adaptive LLM routing with test-time optimal compute,

D. Dinget al., “BEST-Route: Adaptive LLM routing with test-time optimal compute,” inProc. Int. Conf. Mach. Learn. (ICML), 2025, pp. 13 870–13 884

work page 2025
[25]

Dynamic quality-latency aware routing for LLM inference in wireless edge-device networks,

R. Bao, N. Xue, Y . Sun, and Z. Chen, “Dynamic quality-latency aware routing for LLM inference in wireless edge-device networks,” inProc. IEEE/CIC Int. Conf. Commun. China Workshops (ICCC Workshops), 2025, pp. 1–6

work page 2025
[26]

Smoothie: Label free language model routing,

N. Guha, M. F. Chen, T. Chow, I. S. Khare, and C. Re, “Smoothie: Label free language model routing,”Adv. Neural Inf. Process. Syst., vol. 37, pp. 127 645–127 672, 2024

work page 2024
[27]

Capability instruction tuning,

Y .-K. Zhang, D.-C. Zhan, and H.-J. Ye, “Capability instruction tuning,” inProc. AAAI Conf. Artif. Intell., vol. 39, no. 24, 2025, pp. 25 958– 25 966

work page 2025
[28]

FrugalGPT: How to use large language models while reducing cost and improving performance,

L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,”Trans. Mach. Learn. Res., 2024

work page 2024
[29]

Quality-of- service aware LLM routing for edge computing with multiple experts,

J. Yang, Q. Wu, Z. Feng, Z. Zhou, D. Guo, and X. Chen, “Quality-of- service aware LLM routing for edge computing with multiple experts,” IEEE Trans. Mobile Comput., vol. 24, no. 12, pp. 13 648–13 662, 2025

work page 2025
[30]

EdgeBERT: Sentence-level energy optimizations for latency-aware multi-task NLP inference,

T. Tambeet al., “EdgeBERT: Sentence-level energy optimizations for latency-aware multi-task NLP inference,” inProc. IEEE/ACM Int. Symp. Microarchitecture (MICRO), 2021, pp. 830–844

work page 2021
[31]

SlimCaching: Edge caching of mixture-of-experts for distributed inference,

Q. Chen, X. Chen, and K. Huang, “SlimCaching: Edge caching of mixture-of-experts for distributed inference,”IEEE Trans. Mobile Com- put., pp. 1–15, 2026

work page 2026
[32]

WDMoE: Wireless distributed mixture of experts for large language models,

N. Xueet al., “WDMoE: Wireless distributed mixture of experts for large language models,”IEEE Trans. Wireless Commun., 2025

work page 2025
[33]

Stable-MoE: Lyapunov-based token routing for distributed mixture-of-experts training over edge networks,

L. Shi, B. Ou, K. Wei, W. Zhu, Z. Wang, and Z. Chen, “Stable-MoE: Lyapunov-based token routing for distributed mixture-of-experts training over edge networks,”IEEE Trans. V eh. Technol., 2026

work page 2026
[34]

CSGO: Generalized optimization for cold start in wireless collaborative edge LLM systems,

X. Liuet al., “CSGO: Generalized optimization for cold start in wireless collaborative edge LLM systems,”J. Commun. Inf. Netw., vol. 10, no. 4, pp. 340–351, 2025

work page 2025
[35]

WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference

Z. Liuet al., “WISV: Wireless-informed semantic verification for distributed speculative decoding in device-edge LLM inference,”arXiv preprint arXiv:2604.17701, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

MoE 2: Optimizing collaborative inference for edge large language models,

L. Jinet al., “MoE 2: Optimizing collaborative inference for edge large language models,”IEEE Trans. Netw., vol. 34, pp. 4637–4651, 2026

work page 2026
[37]

Large language model-empowered resource allocation in intent-driven wireless networks,

H. Sunet al., “Large language model-empowered resource allocation in intent-driven wireless networks,”IEEE Trans. Cogn. Commun. Netw., vol. 12, pp. 6265–6280, 2026

work page 2026
[38]

Em- bedLLM: Learning compact representations of large language models,

R. Zhuang, T. Wu, Z. Wen, A. Li, J. Jiao, and K. Ramchandran, “Em- bedLLM: Learning compact representations of large language models,” inProc. Int. Conf. Learn. Represent. (ICLR), 2025

work page 2025
[39]

RouterDC: Query- based router by dual contrastive learning for assembling large language models,

S. Chen, W. Jiang, B. Lin, J. Kwok, and Y . Zhang, “RouterDC: Query- based router by dual contrastive learning for assembling large language models,”Adv. Neural Inf. Process. Syst., vol. 37, pp. 66 305–66 328, 2024

work page 2024
[40]

RadialRouter: Structured representation for efficient and robust large language models routing,

R. Jinet al., “RadialRouter: Structured representation for efficient and robust large language models routing,” inFindings Assoc. Comput. Linguistics: EMNLP, 2025, pp. 14 587–14 600

work page 2025
[41]

V ovk, A

V . V ovk, A. Gammerman, and G. Shafer,Algorithmic Learning in a Random World. Springer, 2005

work page 2005
[42]

Learn then test: Calibrating predictive algorithms to achieve risk control,

A. N. Angelopoulos, S. Bates, E. J. Cand `es, M. I. Jordan, and L. Lei, “Learn then test: Calibrating predictive algorithms to achieve risk control,”Ann. Appl. Stat., vol. 19, no. 2, pp. 1641–1662, 2025

work page 2025
[43]

Measuring massive multitask language understand- ing,

D. Hendryckset al., “Measuring massive multitask language understand- ing,” inProc. Int. Conf. Learn. Represent. (ICLR), 2021

work page 2021
[44]

Challenging BIG-Bench tasks and whether chain-of- thought can solve them,

M. Suzgunet al., “Challenging BIG-Bench tasks and whether chain-of- thought can solve them,” inFindings Assoc. Comput. Linguistics: ACL, 2023, pp. 13 003–13 051

work page 2023
[45]

GPQA: A graduate-level Google-proof Q&A bench- mark,

D. Reinet al., “GPQA: A graduate-level Google-proof Q&A bench- mark,” inProc. First Conf. Lang. Model. (COLM), 2024

work page 2024
[46]

Program Synthesis with Large Language Models

J. Austinet al., “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[47]

Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro

L. Gaoet al., “The Language Model Evaluation Harness,” Jul. 2024. [Online]. Available: https://zenodo.org/records/12608602

work page arXiv 2024
[48]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T

Q. J. Huet al., “RouterBench: A benchmark for multi-LLM routing system,”arXiv preprint arXiv:2403.12031, 2024

work page arXiv 2024
[49]

LLMRank: Understanding LLM strengths for model routing,

S. Agrawal and P. Gupta, “LLMRank: Understanding LLM strengths for model routing,”arXiv preprint arXiv:2510.01234, 2025

work page arXiv 2025