arxiv: 2605.11603 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: no theorem link

GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization

Disha Sheshanarayana , Rajat Subhra Pal , Manjira Sinha , Tirthankar Dasgupta

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:36 UTC · model grok-4.3

classification 💻 cs.AI

keywords carbon-aware routingLLM inferenceconstrained optimizationmodel selectionenergy efficiencysustainable computingprimal-dual algorithmservice-level objectives

0 comments

The pith

GAR routes each LLM request to minimize carbon emissions while enforcing accuracy floors and p95 latency bounds across heterogeneous model pools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Green-Aware Routing as a way to choose which model handles a given request from a pool of LLMs ranging from 7B to 70B parameters. It treats carbon emissions as the quantity to minimize and casts accuracy and tail latency as hard constraints that must be met. Lightweight estimators predict those three quantities from request features alone, and an online primal-dual algorithm adjusts the constraints on the fly using per-dataset tuning. If the approach works, inference systems could shift load toward lower-carbon models during high-intensity grid periods without new hardware or extra inference passes.

Core claim

GAR is a constrained multi-objective optimization framework that minimizes per-request CO2 emissions subject to explicit accuracy floors and p95-latency service-level objectives. It employs adaptive constraint optimization through per-dataset floor tuning together with lightweight estimators for correctness, tail latency, and carbon emissions, enabling real-time routing decisions without additional inference passes. GAR-PD provides a practical online primal-dual routing algorithm for rolling carbon budgets, while heuristic variants maintain high feasibility coverage with limited accuracy degradation. Experiments on standard NLP benchmarks demonstrate substantial carbon reductions while the p

What carries the argument

The constrained multi-objective optimization that places carbon emissions as the objective and accuracy plus p95 latency as enforceable constraints, solved with per-dataset adaptive tuning and an online primal-dual algorithm.

If this is right

Requests can be steered to smaller or more efficient models when grid carbon intensity rises, provided the accuracy floor for that task remains satisfied.
Per-dataset tuning lets the same framework adapt accuracy requirements to different benchmarks without manual retuning of every constraint.
Rolling carbon budgets become enforceable over time windows rather than single requests.
Heuristic approximations offer practical fallbacks that still respect the latency bound when exact optimization is computationally expensive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same constraint-based framing could be applied to other scarce resources such as memory bandwidth or specialized accelerator time.
If the estimators generalize, the router could incorporate real-time regional carbon intensity data to prefer models located in cleaner grids.
The approach points toward treating sustainability metrics as first-class constraints in any multi-model serving system rather than post-hoc filters.

Load-bearing premise

Lightweight estimators for accuracy, tail latency, and emissions can be trained to stay accurate enough for real-time decisions without extra model runs, and per-dataset floor tuning keeps the constraints feasible across different model sizes.

What would settle it

Run the router live on a new dataset with measured carbon intensity traces and check whether measured accuracy drops below the tuned floor or p95 latency exceeds the target on a statistically significant fraction of requests.

Figures

Figures reproduced from arXiv: 2605.11603 by Disha Sheshanarayana, Manjira Sinha, Rajat Subhra Pal, Tirthankar Dasgupta.

**Figure 2.** Figure 2: Baseline performance comparison across model pool and benchmark datasets. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Comprehensive performance analysis of GAR methods. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

The growing deployment of large language models (LLMs) makes per-request routing essential for balancing response quality and computational cost across heterogeneous model pools. Current routing methods rarely consider sustainable energy use and CO2 emissions as optimization objectives, despite grid carbon intensity varying by time and region, and models differing significantly in energy consumption. To address this gap, we introduce Green-Aware Routing (GAR), a constrained multi-objective optimization framework that minimizes per-request CO2 emissions subject to explicit accuracy floors and p95-latency service-level objectives (SLOs). GAR employs adaptive constraint optimization through per-dataset floor tuning and incorporates lightweight estimators for correctness, tail latency, and carbon emissions, enabling real-time routing decisions without additional inference passes. We present GAR-PD, a practical online primal-dual routing algorithm for rolling carbon budgets, alongside heuristic variants that achieve high feasibility coverage while limiting accuracy degradation. Comprehensive experiments across standard NLP benchmarks with heterogeneous LLM pools (7B-70B) demonstrate that GAR achieves substantial carbon reductions while maintaining competitive accuracy and p95 latency guarantees, providing a practical, theoretically grounded approach to sustainable LLM inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GAR applies primal-dual constrained optimization to minimize carbon in LLM routing under accuracy and latency floors, but the abstract gives no numbers or estimator details so the practical gains stay unproven.

read the letter

The main thing to know is that this paper takes existing multi-objective routing ideas and adds carbon intensity as the primary objective, using a primal-dual online algorithm plus per-dataset accuracy floor tuning to handle rolling budgets across 7B-70B model pools. That combination looks new relative to prior latency-or-cost routers, and the framing of lightweight estimators for correctness, tail latency, and emissions is a sensible way to keep decisions real-time without extra passes. The authors also sketch heuristic variants that trade some coverage for feasibility, which shows they thought about deployment constraints. That part is useful and worth crediting. The approach sits on standard optimization primitives rather than inventing new ones, but applying them to carbon-aware inference serving is a timely move given grid variability. The paper does a clean job stating the problem and the GAR-PD algorithm without overpromising in the abstract itself. The soft spots are more about missing evidence than conceptual flaws. The abstract claims substantial carbon reductions with competitive accuracy and p95 guarantees on standard NLP benchmarks, yet supplies no quantitative results, baselines, error bars, or constraint satisfaction rates. Without those, it is impossible to judge whether the estimators actually deliver low enough error for the claims to hold. The stress-test concern about tail-latency prediction is on target here: moderate underestimation on larger models or bursty prompts would let the optimizer claim carbon wins while violating SLOs in practice. Per-dataset floor tuning also assumes offline accuracy surfaces map cleanly to online heterogeneous pools, which may hide feasibility gaps under real load. This work is aimed at people building production inference routers or studying sustainable AI systems. A reader already working on quality-cost-latency tradeoffs would get concrete ideas to extend, provided the experiments later show the estimators are accurate enough. It deserves peer review because the topic matters and the optimization setup is grounded, but any referee will need to see the actual numbers, ablation on estimator error, and constraint violation rates before the central claims can be taken seriously.

Referee Report

2 major / 0 minor

Summary. The paper introduces Green-Aware Routing (GAR), a constrained multi-objective optimization framework that minimizes per-request CO2 emissions for LLM inference subject to accuracy floors and p95-latency SLOs. It employs lightweight estimators for correctness, tail latency, and carbon emissions to enable real-time routing without additional inference passes. The authors present GAR-PD, an online primal-dual algorithm for rolling carbon budgets, and heuristic variants. Experiments on standard NLP benchmarks with heterogeneous 7B-70B model pools are reported to achieve substantial carbon reductions while maintaining competitive accuracy and p95 latency guarantees.

Significance. If the experimental results and estimator accuracies hold, this work would be significant for sustainable AI and LLM serving. It fills a gap by incorporating variable grid carbon intensity and model energy differences into routing via constrained optimization, offering a practical method beyond cost/latency-focused approaches. The primal-dual formulation and per-dataset tuning provide theoretical grounding for online decisions under carbon budgets.

major comments (2)

[Abstract] Abstract: The abstract asserts experimental success on NLP benchmarks with 'substantial carbon reductions' and 'competitive accuracy' but supplies no quantitative results, error bars, baseline comparisons, or details on estimator training and constraint satisfaction rates, leaving the central claim unsupported by visible evidence.
[Framework description] Framework description: The constrained optimization claims depend on lightweight estimators for p95 latency and carbon emissions achieving low prediction error to avoid SLO violations. No validation is provided on estimator fidelity across 7B-70B model sizes, prompt characteristics, or hardware variations, nor on how per-dataset accuracy floor tuning maps to online feasibility without hidden gaps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the abstract and provide more explicit validation details. We address each major comment below and outline the corresponding revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts experimental success on NLP benchmarks with 'substantial carbon reductions' and 'competitive accuracy' but supplies no quantitative results, error bars, baseline comparisons, or details on estimator training and constraint satisfaction rates, leaving the central claim unsupported by visible evidence.

Authors: We agree that the abstract would be improved by including concrete quantitative results to support the claims. In the revised version, we will incorporate specific metrics drawn from the experimental results, such as average per-request CO2 reductions (with ranges across benchmarks), accuracy maintenance relative to baselines, p95 latency SLO satisfaction rates, estimator prediction errors, and constraint feasibility percentages. This will make the central contributions more evident while maintaining the abstract's length and focus. revision: yes
Referee: [Framework description] Framework description: The constrained optimization claims depend on lightweight estimators for p95 latency and carbon emissions achieving low prediction error to avoid SLO violations. No validation is provided on estimator fidelity across 7B-70B model sizes, prompt characteristics, or hardware variations, nor on how per-dataset accuracy floor tuning maps to online feasibility without hidden gaps.

Authors: The manuscript reports estimator performance and experimental outcomes in the evaluation section, including prediction errors for latency and carbon models. To address the request for more explicit validation, we will add a dedicated paragraph in the framework section summarizing estimator fidelity results broken down by model size (7B-70B), prompt characteristics, and constraint satisfaction rates under per-dataset accuracy floor tuning. This will clarify the mapping to online feasibility. We note that hardware variation analysis is limited to the tested setups and can be expanded with a limitations statement if needed. revision: partial

Circularity Check

0 steps flagged

No significant circularity in GAR's optimization framework or claims

full rationale

The paper defines GAR as a constrained multi-objective optimization that minimizes CO2 subject to accuracy floors and p95 SLOs, using standard primal-dual methods and lightweight estimators trained separately for correctness, latency, and emissions. No quoted equations or steps show a prediction reducing to its own fitted inputs by construction, no self-citation load-bearing the central result, and no ansatz or uniqueness imported from prior author work. Experiments on NLP benchmarks with 7B-70B pools provide external validation of reductions while meeting constraints, keeping the derivation self-contained against the described inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; full paper likely contains additional fitted estimator parameters and domain assumptions about model heterogeneity.

axioms (1)

domain assumption Lightweight estimators can predict per-model accuracy, p95 latency, and carbon emissions with sufficient fidelity for real-time constrained optimization
Invoked to enable routing without extra inference passes

pith-pipeline@v0.9.0 · 5503 in / 1199 out tokens · 31886 ms · 2026-05-13T01:36:37.436354+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 4 internal anchors

[1]

Cargo: A framework for confidence-aware routing of llm queries.arXiv preprint arXiv:2509.14899,

Ahmed Barrak, Ahmed Abdelsalam, Karan Jain, et al. Cargo: A framework for confidence-aware routing of llm queries.arXiv preprint arXiv:2509.14899,

work page arXiv
[2]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Daniel Khashabi, Oyvind Sud, Ashish Sabharwal, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Mea- suring the carbon intensity of AI in cloud instances

Jesse Dodge, Taylor Prewitt, Remi Tachet des Combes, Erika Odmark, Roy Schwartz, Emma Strubell, Alexandra Sasha Luccioni, Noah A Smith, Nicole DeCario, and Will Buchanan. Mea- suring the carbon intensity of AI in cloud instances. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1877–1894,

work page 2022
[6]

arXiv preprint arXiv:2410.03834 , year=

Tao Feng, Yanzhen Shen, and Jiaxuan You. Graphrouter: A graph-based router for llm selections. arXiv preprint arXiv:2410.03834,

work page arXiv
[7]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system.arXiv preprint arXiv:2403.12031,

work page arXiv
[8]

Clover: Toward sustainable ai with carbon-aware machine learning inference service.arXiv preprint arXiv:2304.09781,

Baolin Li, Siddharth Samsi, Vijay Gadepally, and Devesh Tiwari. Clover: Toward sustainable ai with carbon-aware machine learning inference service.arXiv preprint arXiv:2304.09781,

work page arXiv
[9]

Sprout: Green generative ai with carbon-efficient llm inference

Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. Sprout: Green generative ai with carbon-efficient llm inference. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 22074–22086,

work page 2024
[10]

Ecoserve: Designing carbon-aware ai inference systems.arXiv preprint arXiv:2502.05043,

Yueying Li, Zhanqiu Hu, Esha Choukse, Rodrigo Fonseca, G Edward Suh, and Udit Gupta. Ecoserve: Designing carbon-aware ai inference systems.arXiv preprint arXiv:2502.05043,

work page arXiv
[11]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Squad: 100,000+ questions for machine comprehension of text

10 Preprint Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392,

work page 2016