Recognition: 2 theorem links
· Lean TheoremCR²: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference
Pith reviewed 2026-05-13 04:51 UTC · model grok-4.3
The pith
CR^2 routes LLM queries between device and edge to match full-information accuracy at lower cost using only local signals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CR^2 decouples a lightweight on-device margin gate from an edge-side utility selector for deferred queries. The margin gate operates on frozen query embeddings and a user-specified cost weight to predict whether local execution is utility-optimal relative to the best edge alternative. A conformal risk control calibration procedure maps each operating point to an acceptance threshold, enabling explicit control of the marginal false-acceptance risk under the full-information utility reference. Experiments show that CR^2 closely matches a full-information reference router using only device-side signals before deferral and reduces normalized deployment cost by up to 16.9% at matched accuracy.
What carries the argument
The margin gate, which predicts local execution optimality relative to edge alternatives based on cost weight and frozen embeddings, combined with CRC calibration for risk-controlled thresholds.
If this is right
- CR^2 consistently improves the deployable accuracy-cost Pareto frontier compared with strong query-level baselines.
- It reduces normalized deployment cost by up to 16.9% at matched accuracy.
- It enables explicit control of the marginal false-acceptance risk under the full-information utility reference.
- It achieves near full-information performance while relying solely on device-side signals before any deferral.
Where Pith is reading between the lines
- The two-stage separation of device gate and edge selector may simplify updates when edge models change independently of on-device hardware.
- Similar margin-gate plus conformal calibration patterns could extend to other deferral tasks in edge computing if the utility reference remains stable.
- Lowering cost at fixed accuracy may increase the number of queries that can run safely on battery-constrained devices without edge fallback.
Load-bearing premise
The full-information utility reference used for CRC calibration accurately represents real deployment conditions and the margin gate generalizes across queries and operating points without significant distribution shift.
What would settle it
Measure the realized false-acceptance rate and actual deployment cost in live wireless tests against the calibrated bounds and check whether the 16.9% cost reduction at matched accuracy still holds.
Figures
read the original abstract
As large language models (LLMs) move from centralized clouds to mobile edge environments, efficient serving must balance latency, energy consumption, and accuracy under constrained device-edge resources. Query-level routing between lightweight on-device models and stronger edge models provides a flexible mechanism to navigate this trade-off. However, existing routers are designed for centralized cloud settings and optimize token-level costs, failing to capture the dynamic latency and energy overheads in wireless edge deployments. In this paper, we formulate mobile edge LLM routing as a deployment-constrained, cost-aware decision problem, and propose CR^2, a two-stage device-edge routing framework. CR^2 decouples a lightweight on-device margin gate from an edge-side utility selector for deferred queries. The margin gate operates on frozen query embeddings and a user-specified cost weight to predict whether local execution is utility-optimal relative to the best edge alternative under the target operating point. We further introduce a conformal risk control (CRC) calibration procedure that maps each operating point to an acceptance threshold, enabling explicit control of the marginal false-acceptance risk under the full-information utility reference. Experiments on the routing task show that CR^2 closely matches a full-information reference router using only device-side signals before deferral. Compared with strong query-level baselines, CR^2 consistently improves the deployable accuracy-cost Pareto frontier and reduces normalized deployment cost by up to 16.9% at matched accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CR², a two-stage device-edge routing framework for wireless LLM inference. A lightweight on-device margin gate operates on frozen query embeddings and a user-specified cost weight to decide local execution versus deferral; an edge-side utility selector handles deferred queries. Conformal risk control (CRC) is used to calibrate acceptance thresholds against a full-information utility reference, with the goal of controlling marginal false-acceptance risk. Experiments claim that CR² closely matches the full-information reference router while using only device-side signals, improves the deployable accuracy-cost Pareto frontier over strong baselines, and reduces normalized deployment cost by up to 16.9% at matched accuracy.
Significance. If the empirical claims hold under realistic wireless conditions, the work provides a practical mechanism for cost-aware routing with explicit risk guarantees via CRC. The decoupling of the device-side gate from the edge selector and the use of CRC for tunable operating points are strengths that could aid deployment under resource constraints. The paper supplies falsifiable predictions through its Pareto-frontier comparisons and cost-reduction figures.
major comments (2)
- [CRC calibration procedure] CRC calibration procedure: the procedure maps device-side margin-gate outputs to thresholds using a full-information utility reference that incorporates both device and edge model outcomes plus exact costs. The manuscript provides no sensitivity analysis or experiments incorporating stochastic wireless channel traces (variable transmission latency or energy draw), which risks violating the exchangeability assumption needed for the marginal coverage guarantee to transfer to real deployments. This directly underpins the central claim that CR² 'closely matches' the reference router.
- [Experimental results] Experimental results: the headline 16.9% normalized cost reduction and Pareto-frontier improvements are reported without error bars, confidence intervals, query-split details, or statistical significance tests. It is therefore impossible to determine whether the gains are robust or could arise from post-hoc threshold selection.
minor comments (2)
- [Abstract] The abstract refers to 'strong query-level baselines' without naming them; explicit identification would allow readers to assess the strength of the comparison.
- [Methods] Notation for the margin gate, utility selector, and CRC threshold mapping would be clearer if introduced with explicit equations at the start of the methods section.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and agree to incorporate additional analysis and statistical reporting to strengthen the manuscript.
read point-by-point responses
-
Referee: [CRC calibration procedure] CRC calibration procedure: the procedure maps device-side margin-gate outputs to thresholds using a full-information utility reference that incorporates both device and edge model outcomes plus exact costs. The manuscript provides no sensitivity analysis or experiments incorporating stochastic wireless channel traces (variable transmission latency or energy draw), which risks violating the exchangeability assumption needed for the marginal coverage guarantee to transfer to real deployments. This directly underpins the central claim that CR² 'closely matches' the reference router.
Authors: We appreciate the referee's emphasis on the exchangeability assumption underlying conformal risk control. Our calibration uses a full-information utility reference computed from average device and edge costs under the target operating point, which preserves exchangeability with respect to the query distribution in our experimental setup. We acknowledge that real-world stochastic channel variations (e.g., latency jitter) could affect empirical coverage. In the revision we will add a dedicated sensitivity analysis subsection that injects stochastic wireless traces (Rayleigh fading for transmission energy/latency) into the utility reference and reports the resulting marginal coverage rates. This will directly support the robustness of the 'closely matches' claim under more realistic conditions. revision: yes
-
Referee: [Experimental results] Experimental results: the headline 16.9% normalized cost reduction and Pareto-frontier improvements are reported without error bars, confidence intervals, query-split details, or statistical significance tests. It is therefore impossible to determine whether the gains are robust or could arise from post-hoc threshold selection.
Authors: We agree that error bars, confidence intervals, and significance testing are necessary to establish robustness. The reported 16.9% figure is the maximum improvement observed across operating points; the original experiments used fixed train/test splits without repeated sampling. In the revised manuscript we will (i) report mean and standard deviation of cost reduction and Pareto metrics over five random query splits, (ii) add error bars to all Pareto-frontier and cost-reduction plots, and (iii) include paired statistical tests (t-test or Wilcoxon signed-rank) against baselines to confirm that improvements are not attributable to post-hoc threshold choice. revision: yes
Circularity Check
No significant circularity; derivation uses external reference and standard CRC.
full rationale
The paper trains a device-side margin gate on frozen embeddings to predict utility-optimality labels derived from a full-information reference (device + edge outcomes + costs). It then applies standard conformal risk control (CRC) on a calibration set to set acceptance thresholds that guarantee marginal false-acceptance risk w.r.t. that same reference. The headline claim of 'closely matches' is an empirical comparison on held-out queries, not a definitional identity or a fitted parameter renamed as prediction. CRC coverage is a known property independent of the paper's fitted values; the reference is external to the device signals used at inference. No self-citation load-bearing step, no ansatz smuggled via prior work, and no reduction of the Pareto improvement to the inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- user-specified cost weight
- CRC acceptance threshold
axioms (1)
- standard math Conformal risk control guarantees marginal coverage under exchangeability
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWe formulate mobile edge LLM routing as a deployment-constrained, cost-aware decision problem... conformal risk control (CRC) calibration procedure that maps each operating point to an acceptance threshold
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclearthe scalarized utility um(x, ξ;λ) = ym(x) − λ cm(x, ξ)
Reference graph
Works this paper leans on
-
[1]
J. Achiamet al., “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Mobile edge intelligence for large language models: A contemporary survey,
G. Qu, Q. Chen, W. Wei, Z. Lin, X. Chen, and K. Huang, “Mobile edge intelligence for large language models: A contemporary survey,”IEEE Commun. Surveys Tuts., vol. 27, no. 6, pp. 3820–3860, 2025
work page 2025
-
[3]
A. Yanget al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [4]
-
[5]
EdgeShard: Efficient LLM inference via collaborative edge computing,
M. Zhang, X. Shen, J. Cao, Z. Cui, and S. Jiang, “EdgeShard: Efficient LLM inference via collaborative edge computing,”IEEE Internet Things J., vol. 12, no. 10, pp. 13 119–13 131, 2024
work page 2024
-
[6]
CE-CoLLM: Efficient and adaptive large language models through cloud-edge collaboration,
H. Jin and Y . Wu, “CE-CoLLM: Efficient and adaptive large language models through cloud-edge collaboration,” inProc. IEEE Int. Conf. Web Services (ICWS), 2025, pp. 316–323
work page 2025
-
[7]
M. Xu, D. Niyato, and C. G. Brinton, “Serving long-context LLMs at the mobile edge: Test-time reinforcement learning-based model caching and inference offloading,”IEEE Trans. Netw., 2026
work page 2026
-
[8]
Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,
Y . Kanget al., “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,”ACM SIGARCH Comput. Archit. News, vol. 45, no. 1, pp. 615–629, 2017
work page 2017
-
[9]
SPINN: Synergistic progressive inference of neural networks over device and cloud,
S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D. Lane, “SPINN: Synergistic progressive inference of neural networks over device and cloud,” inProc. ACM MobiCom, 2020, pp. 1–15
work page 2020
-
[10]
Edge intelligence: On-demand deep learning model co-inference with device-edge synergy,
E. Li, Z. Zhou, and X. Chen, “Edge intelligence: On-demand deep learning model co-inference with device-edge synergy,” inProc. ACM SIGCOMM Workshop Mobile Edge Commun., 2018, pp. 31–36
work page 2018
-
[11]
Fast inference from transform- ers via speculative decoding,
Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transform- ers via speculative decoding,” inProc. Int. Conf. Mach. Learn. (ICML), 2023, pp. 19 274–19 286
work page 2023
-
[12]
Sequoia: Scalable, robust, and hardware-aware specu- lative decoding,
Z. Chenet al., “Sequoia: Scalable, robust, and hardware-aware specu- lative decoding,”arXiv preprint arXiv:2402.12374, 2024
-
[13]
EAGLE: Speculative sampling requires rethinking feature uncertainty,
Y . Li, F. Wei, C. Zhang, and H. Zhang, “EAGLE: Speculative sampling requires rethinking feature uncertainty,” inProc. Int. Conf. Mach. Learn. (ICML), 2024, pp. 28 935–28 948. 13
work page 2024
-
[14]
A. N. Angelopoulos, S. Bates, A. Fisch, L. Lei, and T. Schuster, “Conformal Risk Control,” inProc. Int. Conf. Learn. Represent. (ICLR), 2024
work page 2024
-
[15]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,
W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, no. 120, pp. 1–39, 2022
work page 2022
-
[16]
Mixture-of-experts with expert choice routing,
Y . Zhouet al., “Mixture-of-experts with expert choice routing,”Adv. Neural Inf. Process. Syst., vol. 35, pp. 7103–7114, 2022
work page 2022
-
[17]
R2-T2: Re-routing in test-time for multimodal mixture-of-experts,
Z. Li, Z. Li, and T. Zhou, “R2-T2: Re-routing in test-time for multimodal mixture-of-experts,” inProc. Int. Conf. Mach. Learn. (ICML), 2025, pp. 35 292–35 316
work page 2025
-
[18]
Routing to the expert: Efficient reward-guided ensemble of large language models,
K. Luet al., “Routing to the expert: Efficient reward-guided ensemble of large language models,” inProc. Conf. North Amer . Chapter Assoc. Comput. Linguistics: Human Lang. Technol. (NAACL-HLT), 2024, pp. 1964–1974
work page 2024
-
[19]
TriSpec: Ternary speculative decoding via lightweight proxy verification,
H. Jianget al., “TriSpec: Ternary speculative decoding via lightweight proxy verification,”arXiv preprint arXiv:2601.23180, 2026
-
[20]
RouteLLM: Learning to route LLMs from preference data,
I. Onget al., “RouteLLM: Learning to route LLMs from preference data,” inProc. Int. Conf. Learn. Represent. (ICLR), 2025
work page 2025
-
[21]
Hybrid LLM: Cost-efficient and quality-aware query routing,
D. Dinget al., “Hybrid LLM: Cost-efficient and quality-aware query routing,” inProc. Int. Conf. Learn. Represent. (ICLR), 2024
work page 2024
-
[22]
GraphRouter: A graph-based router for LLM selections,
T. Feng, Y . Shen, and J. You, “GraphRouter: A graph-based router for LLM selections,” inProc. Int. Conf. Learn. Represent. (ICLR), 2025
work page 2025
-
[23]
TensorOpera router: A multi-model router for efficient LLM inference,
D. Stripeliset al., “TensorOpera router: A multi-model router for efficient LLM inference,” inProc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), Industry Track, 2024, pp. 452–462
work page 2024
-
[24]
BEST-Route: Adaptive LLM routing with test-time optimal compute,
D. Dinget al., “BEST-Route: Adaptive LLM routing with test-time optimal compute,” inProc. Int. Conf. Mach. Learn. (ICML), 2025, pp. 13 870–13 884
work page 2025
-
[25]
Dynamic quality-latency aware routing for LLM inference in wireless edge-device networks,
R. Bao, N. Xue, Y . Sun, and Z. Chen, “Dynamic quality-latency aware routing for LLM inference in wireless edge-device networks,” inProc. IEEE/CIC Int. Conf. Commun. China Workshops (ICCC Workshops), 2025, pp. 1–6
work page 2025
-
[26]
Smoothie: Label free language model routing,
N. Guha, M. F. Chen, T. Chow, I. S. Khare, and C. Re, “Smoothie: Label free language model routing,”Adv. Neural Inf. Process. Syst., vol. 37, pp. 127 645–127 672, 2024
work page 2024
-
[27]
Capability instruction tuning,
Y .-K. Zhang, D.-C. Zhan, and H.-J. Ye, “Capability instruction tuning,” inProc. AAAI Conf. Artif. Intell., vol. 39, no. 24, 2025, pp. 25 958– 25 966
work page 2025
-
[28]
FrugalGPT: How to use large language models while reducing cost and improving performance,
L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,”Trans. Mach. Learn. Res., 2024
work page 2024
-
[29]
Quality-of- service aware LLM routing for edge computing with multiple experts,
J. Yang, Q. Wu, Z. Feng, Z. Zhou, D. Guo, and X. Chen, “Quality-of- service aware LLM routing for edge computing with multiple experts,” IEEE Trans. Mobile Comput., vol. 24, no. 12, pp. 13 648–13 662, 2025
work page 2025
-
[30]
EdgeBERT: Sentence-level energy optimizations for latency-aware multi-task NLP inference,
T. Tambeet al., “EdgeBERT: Sentence-level energy optimizations for latency-aware multi-task NLP inference,” inProc. IEEE/ACM Int. Symp. Microarchitecture (MICRO), 2021, pp. 830–844
work page 2021
-
[31]
SlimCaching: Edge caching of mixture-of-experts for distributed inference,
Q. Chen, X. Chen, and K. Huang, “SlimCaching: Edge caching of mixture-of-experts for distributed inference,”IEEE Trans. Mobile Com- put., pp. 1–15, 2026
work page 2026
-
[32]
WDMoE: Wireless distributed mixture of experts for large language models,
N. Xueet al., “WDMoE: Wireless distributed mixture of experts for large language models,”IEEE Trans. Wireless Commun., 2025
work page 2025
-
[33]
L. Shi, B. Ou, K. Wei, W. Zhu, Z. Wang, and Z. Chen, “Stable-MoE: Lyapunov-based token routing for distributed mixture-of-experts training over edge networks,”IEEE Trans. V eh. Technol., 2026
work page 2026
-
[34]
CSGO: Generalized optimization for cold start in wireless collaborative edge LLM systems,
X. Liuet al., “CSGO: Generalized optimization for cold start in wireless collaborative edge LLM systems,”J. Commun. Inf. Netw., vol. 10, no. 4, pp. 340–351, 2025
work page 2025
-
[35]
Z. Liuet al., “WISV: Wireless-informed semantic verification for distributed speculative decoding in device-edge LLM inference,”arXiv preprint arXiv:2604.17701, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
MoE 2: Optimizing collaborative inference for edge large language models,
L. Jinet al., “MoE 2: Optimizing collaborative inference for edge large language models,”IEEE Trans. Netw., vol. 34, pp. 4637–4651, 2026
work page 2026
-
[37]
Large language model-empowered resource allocation in intent-driven wireless networks,
H. Sunet al., “Large language model-empowered resource allocation in intent-driven wireless networks,”IEEE Trans. Cogn. Commun. Netw., vol. 12, pp. 6265–6280, 2026
work page 2026
-
[38]
Em- bedLLM: Learning compact representations of large language models,
R. Zhuang, T. Wu, Z. Wen, A. Li, J. Jiao, and K. Ramchandran, “Em- bedLLM: Learning compact representations of large language models,” inProc. Int. Conf. Learn. Represent. (ICLR), 2025
work page 2025
-
[39]
RouterDC: Query- based router by dual contrastive learning for assembling large language models,
S. Chen, W. Jiang, B. Lin, J. Kwok, and Y . Zhang, “RouterDC: Query- based router by dual contrastive learning for assembling large language models,”Adv. Neural Inf. Process. Syst., vol. 37, pp. 66 305–66 328, 2024
work page 2024
-
[40]
RadialRouter: Structured representation for efficient and robust large language models routing,
R. Jinet al., “RadialRouter: Structured representation for efficient and robust large language models routing,” inFindings Assoc. Comput. Linguistics: EMNLP, 2025, pp. 14 587–14 600
work page 2025
- [41]
-
[42]
Learn then test: Calibrating predictive algorithms to achieve risk control,
A. N. Angelopoulos, S. Bates, E. J. Cand `es, M. I. Jordan, and L. Lei, “Learn then test: Calibrating predictive algorithms to achieve risk control,”Ann. Appl. Stat., vol. 19, no. 2, pp. 1641–1662, 2025
work page 2025
-
[43]
Measuring massive multitask language understand- ing,
D. Hendryckset al., “Measuring massive multitask language understand- ing,” inProc. Int. Conf. Learn. Represent. (ICLR), 2021
work page 2021
-
[44]
Challenging BIG-Bench tasks and whether chain-of- thought can solve them,
M. Suzgunet al., “Challenging BIG-Bench tasks and whether chain-of- thought can solve them,” inFindings Assoc. Comput. Linguistics: ACL, 2023, pp. 13 003–13 051
work page 2023
-
[45]
GPQA: A graduate-level Google-proof Q&A bench- mark,
D. Reinet al., “GPQA: A graduate-level Google-proof Q&A bench- mark,” inProc. First Conf. Lang. Model. (COLM), 2024
work page 2024
-
[46]
Program Synthesis with Large Language Models
J. Austinet al., “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[47]
L. Gaoet al., “The Language Model Evaluation Harness,” Jul. 2024. [Online]. Available: https://zenodo.org/records/12608602
-
[48]
Q. J. Huet al., “RouterBench: A benchmark for multi-LLM routing system,”arXiv preprint arXiv:2403.12031, 2024
-
[49]
LLMRank: Understanding LLM strengths for model routing,
S. Agrawal and P. Gupta, “LLMRank: Understanding LLM strengths for model routing,”arXiv preprint arXiv:2510.01234, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.