pith. machine review for the scientific record. sign in

arxiv: 2605.11603 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: no theorem link

GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:36 UTC · model grok-4.3

classification 💻 cs.AI
keywords carbon-aware routingLLM inferenceconstrained optimizationmodel selectionenergy efficiencysustainable computingprimal-dual algorithmservice-level objectives
0
0 comments X

The pith

GAR routes each LLM request to minimize carbon emissions while enforcing accuracy floors and p95 latency bounds across heterogeneous model pools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Green-Aware Routing as a way to choose which model handles a given request from a pool of LLMs ranging from 7B to 70B parameters. It treats carbon emissions as the quantity to minimize and casts accuracy and tail latency as hard constraints that must be met. Lightweight estimators predict those three quantities from request features alone, and an online primal-dual algorithm adjusts the constraints on the fly using per-dataset tuning. If the approach works, inference systems could shift load toward lower-carbon models during high-intensity grid periods without new hardware or extra inference passes.

Core claim

GAR is a constrained multi-objective optimization framework that minimizes per-request CO2 emissions subject to explicit accuracy floors and p95-latency service-level objectives. It employs adaptive constraint optimization through per-dataset floor tuning together with lightweight estimators for correctness, tail latency, and carbon emissions, enabling real-time routing decisions without additional inference passes. GAR-PD provides a practical online primal-dual routing algorithm for rolling carbon budgets, while heuristic variants maintain high feasibility coverage with limited accuracy degradation. Experiments on standard NLP benchmarks demonstrate substantial carbon reductions while the p

What carries the argument

The constrained multi-objective optimization that places carbon emissions as the objective and accuracy plus p95 latency as enforceable constraints, solved with per-dataset adaptive tuning and an online primal-dual algorithm.

If this is right

  • Requests can be steered to smaller or more efficient models when grid carbon intensity rises, provided the accuracy floor for that task remains satisfied.
  • Per-dataset tuning lets the same framework adapt accuracy requirements to different benchmarks without manual retuning of every constraint.
  • Rolling carbon budgets become enforceable over time windows rather than single requests.
  • Heuristic approximations offer practical fallbacks that still respect the latency bound when exact optimization is computationally expensive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same constraint-based framing could be applied to other scarce resources such as memory bandwidth or specialized accelerator time.
  • If the estimators generalize, the router could incorporate real-time regional carbon intensity data to prefer models located in cleaner grids.
  • The approach points toward treating sustainability metrics as first-class constraints in any multi-model serving system rather than post-hoc filters.

Load-bearing premise

Lightweight estimators for accuracy, tail latency, and emissions can be trained to stay accurate enough for real-time decisions without extra model runs, and per-dataset floor tuning keeps the constraints feasible across different model sizes.

What would settle it

Run the router live on a new dataset with measured carbon intensity traces and check whether measured accuracy drops below the tuned floor or p95 latency exceeds the target on a statistically significant fraction of requests.

Figures

Figures reproduced from arXiv: 2605.11603 by Disha Sheshanarayana, Manjira Sinha, Rajat Subhra Pal, Tirthankar Dasgupta.

Figure 1
Figure 1. Figure 1: Overview of the GAR framework. GAR estimates quality, latency, and carbon emissions [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Baseline performance comparison across model pool and benchmark datasets. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comprehensive performance analysis of GAR methods. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

The growing deployment of large language models (LLMs) makes per-request routing essential for balancing response quality and computational cost across heterogeneous model pools. Current routing methods rarely consider sustainable energy use and CO2 emissions as optimization objectives, despite grid carbon intensity varying by time and region, and models differing significantly in energy consumption. To address this gap, we introduce Green-Aware Routing (GAR), a constrained multi-objective optimization framework that minimizes per-request CO2 emissions subject to explicit accuracy floors and p95-latency service-level objectives (SLOs). GAR employs adaptive constraint optimization through per-dataset floor tuning and incorporates lightweight estimators for correctness, tail latency, and carbon emissions, enabling real-time routing decisions without additional inference passes. We present GAR-PD, a practical online primal-dual routing algorithm for rolling carbon budgets, alongside heuristic variants that achieve high feasibility coverage while limiting accuracy degradation. Comprehensive experiments across standard NLP benchmarks with heterogeneous LLM pools (7B-70B) demonstrate that GAR achieves substantial carbon reductions while maintaining competitive accuracy and p95 latency guarantees, providing a practical, theoretically grounded approach to sustainable LLM inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Green-Aware Routing (GAR), a constrained multi-objective optimization framework that minimizes per-request CO2 emissions for LLM inference subject to accuracy floors and p95-latency SLOs. It employs lightweight estimators for correctness, tail latency, and carbon emissions to enable real-time routing without additional inference passes. The authors present GAR-PD, an online primal-dual algorithm for rolling carbon budgets, and heuristic variants. Experiments on standard NLP benchmarks with heterogeneous 7B-70B model pools are reported to achieve substantial carbon reductions while maintaining competitive accuracy and p95 latency guarantees.

Significance. If the experimental results and estimator accuracies hold, this work would be significant for sustainable AI and LLM serving. It fills a gap by incorporating variable grid carbon intensity and model energy differences into routing via constrained optimization, offering a practical method beyond cost/latency-focused approaches. The primal-dual formulation and per-dataset tuning provide theoretical grounding for online decisions under carbon budgets.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts experimental success on NLP benchmarks with 'substantial carbon reductions' and 'competitive accuracy' but supplies no quantitative results, error bars, baseline comparisons, or details on estimator training and constraint satisfaction rates, leaving the central claim unsupported by visible evidence.
  2. [Framework description] Framework description: The constrained optimization claims depend on lightweight estimators for p95 latency and carbon emissions achieving low prediction error to avoid SLO violations. No validation is provided on estimator fidelity across 7B-70B model sizes, prompt characteristics, or hardware variations, nor on how per-dataset accuracy floor tuning maps to online feasibility without hidden gaps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the abstract and provide more explicit validation details. We address each major comment below and outline the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts experimental success on NLP benchmarks with 'substantial carbon reductions' and 'competitive accuracy' but supplies no quantitative results, error bars, baseline comparisons, or details on estimator training and constraint satisfaction rates, leaving the central claim unsupported by visible evidence.

    Authors: We agree that the abstract would be improved by including concrete quantitative results to support the claims. In the revised version, we will incorporate specific metrics drawn from the experimental results, such as average per-request CO2 reductions (with ranges across benchmarks), accuracy maintenance relative to baselines, p95 latency SLO satisfaction rates, estimator prediction errors, and constraint feasibility percentages. This will make the central contributions more evident while maintaining the abstract's length and focus. revision: yes

  2. Referee: [Framework description] Framework description: The constrained optimization claims depend on lightweight estimators for p95 latency and carbon emissions achieving low prediction error to avoid SLO violations. No validation is provided on estimator fidelity across 7B-70B model sizes, prompt characteristics, or hardware variations, nor on how per-dataset accuracy floor tuning maps to online feasibility without hidden gaps.

    Authors: The manuscript reports estimator performance and experimental outcomes in the evaluation section, including prediction errors for latency and carbon models. To address the request for more explicit validation, we will add a dedicated paragraph in the framework section summarizing estimator fidelity results broken down by model size (7B-70B), prompt characteristics, and constraint satisfaction rates under per-dataset accuracy floor tuning. This will clarify the mapping to online feasibility. We note that hardware variation analysis is limited to the tested setups and can be expanded with a limitations statement if needed. revision: partial

Circularity Check

0 steps flagged

No significant circularity in GAR's optimization framework or claims

full rationale

The paper defines GAR as a constrained multi-objective optimization that minimizes CO2 subject to accuracy floors and p95 SLOs, using standard primal-dual methods and lightweight estimators trained separately for correctness, latency, and emissions. No quoted equations or steps show a prediction reducing to its own fitted inputs by construction, no self-citation load-bearing the central result, and no ansatz or uniqueness imported from prior author work. Experiments on NLP benchmarks with 7B-70B pools provide external validation of reductions while meeting constraints, keeping the derivation self-contained against the described inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; full paper likely contains additional fitted estimator parameters and domain assumptions about model heterogeneity.

axioms (1)
  • domain assumption Lightweight estimators can predict per-model accuracy, p95 latency, and carbon emissions with sufficient fidelity for real-time constrained optimization
    Invoked to enable routing without extra inference passes

pith-pipeline@v0.9.0 · 5503 in / 1199 out tokens · 31886 ms · 2026-05-13T01:36:37.436354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 4 internal anchors

  1. [1]

    Cargo: A framework for confidence-aware routing of llm queries.arXiv preprint arXiv:2509.14899,

    Ahmed Barrak, Ahmed Abdelsalam, Karan Jain, et al. Cargo: A framework for confidence-aware routing of llm queries.arXiv preprint arXiv:2509.14899,

  2. [2]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176,

  3. [3]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Daniel Khashabi, Oyvind Sud, Ashish Sabharwal, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

  4. [4]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  5. [5]

    Mea- suring the carbon intensity of AI in cloud instances

    Jesse Dodge, Taylor Prewitt, Remi Tachet des Combes, Erika Odmark, Roy Schwartz, Emma Strubell, Alexandra Sasha Luccioni, Noah A Smith, Nicole DeCario, and Will Buchanan. Mea- suring the carbon intensity of AI in cloud instances. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1877–1894,

  6. [6]

    arXiv preprint arXiv:2410.03834 , year=

    Tao Feng, Yanzhen Shen, and Jiaxuan You. Graphrouter: A graph-based router for llm selections. arXiv preprint arXiv:2410.03834,

  7. [7]

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T

    Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system.arXiv preprint arXiv:2403.12031,

  8. [8]

    Clover: Toward sustainable ai with carbon-aware machine learning inference service.arXiv preprint arXiv:2304.09781,

    Baolin Li, Siddharth Samsi, Vijay Gadepally, and Devesh Tiwari. Clover: Toward sustainable ai with carbon-aware machine learning inference service.arXiv preprint arXiv:2304.09781,

  9. [9]

    Sprout: Green generative ai with carbon-efficient llm inference

    Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. Sprout: Green generative ai with carbon-efficient llm inference. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 22074–22086,

  10. [10]

    Ecoserve: Designing carbon-aware ai inference systems.arXiv preprint arXiv:2502.05043,

    Yueying Li, Zhanqiu Hu, Esha Choukse, Rodrigo Fonseca, G Edward Suh, and Udit Gupta. Ecoserve: Designing carbon-aware ai inference systems.arXiv preprint arXiv:2502.05043,

  11. [11]

    RouteLLM: Learning to Route LLMs with Preference Data

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665,

  12. [12]

    Squad: 100,000+ questions for machine comprehension of text

    10 Preprint Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392,