Scalable Joint Resource Allocation for SLO-Constrained LLM Inference in Heterogeneous GPU Clouds
Pith reviewed 2026-05-10 17:57 UTC · model grok-4.3
The pith
Two heuristics allocate mixed-scale LLMs on heterogeneous GPUs in under one second while meeting SLOs and approaching optimal cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Adaptive Greedy Heuristic, built from a basic greedy pass plus multi-start construction, relocate local search, and consolidation, together with TP-aware feasibility selection, cost-per-effective-coverage ranking, and TP upgrade, yields feasible allocations in under one second that closely match MILP optimal cost and preserve controlled SLO violations under 1.5x parameter inflation, while the exact solver degrades sharply.
What carries the argument
The Adaptive Greedy Heuristic (AGH) with multi-start construction, relocate-based local search, GPU consolidation, and the three constraint-aware mechanisms of TP-aware feasibility selection, cost-per-effective-coverage ranking, and TP upgrade.
If this is right
- Real-time reallocation becomes feasible for continuously arriving inference requests.
- Operational cost stays near the theoretical minimum without long computation delays.
- Deployments remain stable when model parameter counts shift after initial planning.
- Exact solvers can be reserved for small problems while heuristics handle production scale.
Where Pith is reading between the lines
- The same structure could support online scheduling in cloud systems that adjust GPU pools every few minutes.
- Similar local-search refinements might transfer to other heterogeneous hardware allocation settings such as mixed CPU-GPU training clusters.
- Pre-computing a small library of good starting allocations could cut the multi-start overhead even further.
Load-bearing premise
The three constraint-aware mechanisms can always produce feasible allocations when memory, delay, error, and budget constraints are tightly coupled.
What would settle it
A workload instance where the heuristics output an allocation that violates SLOs or budget while the exact MILP solver finds a feasible lower-cost solution, or where either heuristic takes more than one second on the paper's large-scale instances.
Figures
read the original abstract
Serving large language model (LLM) inference in cloud environments requires jointly optimizing model selection, GPU provisioning, parallelism configuration, and workload routing under latency, accuracy, memory, and budget constraints. While mixed-integer linear programming (MILP) can model this problem, its computational cost limits frequent re-optimization under demand variability. Existing heuristics often optimize individual components separately and may become infeasible when system-wide constraints are enforced. This paper presents a scalable framework for SLO-constrained LLM inference. We formulate the problem as an MILP with a two-phase delay model capturing both prefill and autoregressive decoding under tensor and pipeline parallelism. To solve it efficiently, we develop two constraint-aware heuristics: a Greedy Heuristic (GH) and an Adaptive Greedy Heuristic (AGH). AGH extends GH through multi-start construction, local search, and GPU consolidation. Both methods maintain feasibility through parallelism-aware filtering, cost-based ranking, and adaptive parallelism scaling. Experiments based on the Azure LLM Inference Trace show that GH generates feasible solutions within one second, while AGH achieves near-optimal performance within three seconds and scales to large instances where exact solvers fail to converge. Under out-of-sample stress with up to 1.5x delay and accuracy inflation, AGH degrades gracefully through provisioned headroom, yielding substantially lower cost and SLO violations than cost-minimal MILP solutions. Across synthetic and real Azure workloads, AGH maintains SLO compliance at significantly lower cost than exact MILP solutions. These results demonstrate that high-quality allocations provide substantial robustness to demand variability while enabling rapid adaptation to workload changes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes two constraint-aware heuristics—a single-pass Greedy Heuristic (GH) and an Adaptive Greedy Heuristic (AGH) that adds multi-start construction, relocate-based local search, and GPU consolidation—for jointly selecting base models, heterogeneous GPUs, parallelism degrees, and workload distributions under coupled memory, latency, accuracy, and budget constraints. On workloads derived from the Azure LLM Inference Trace (2025), both heuristics return feasible solutions in under one second; AGH approaches the cost of an exact MILP solver while delivering >260× speedup on large instances. Under out-of-sample stress tests with up to 1.5× parameter inflation, AGH keeps SLO violations controlled and cost stable, whereas the exact solver degrades.
Significance. If the reported speedups and robustness hold, the work would be significant for practical LLM serving systems: it shows that carefully designed, constraint-aware heuristics can make mixed-integer allocation tractable at scale without sacrificing feasibility or solution quality. The explicit use of a public trace for calibration plus controlled out-of-sample inflation provides a reproducible empirical foundation that is stronger than purely synthetic evaluations.
minor comments (3)
- [Abstract] Abstract: the performance claims (sub-second runtimes, 260× speedup, controlled violations) are stated without any reference to the methods section or to the three named mechanisms (TP-aware feasibility selection, cost-per-effective-coverage ranking, TP upgrade); a single sentence summarizing how these mechanisms enforce feasibility would make the abstract self-contained.
- [Evaluation] Evaluation section: while the text describes success on the Azure trace and 1.5× inflation tests, no summary table or figure reports the exact cost ratios, violation counts, or runtime distributions across instance sizes; adding such a table would allow readers to assess the “closely approaching optimal” claim quantitatively.
- [Methods] The manuscript would benefit from a short pseudocode listing or algorithmic sketch of AGH (multi-start + relocate + consolidation) to complement the prose description of the three constraint-aware mechanisms.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our work on the GH and AGH heuristics for mixed-scale LLM allocation under SLO constraints, as well as for highlighting the significance of the speedups, feasibility guarantees, and reproducible evaluation on the Azure trace with out-of-sample stress tests. The recommendation for minor revision is noted; we will incorporate any editorial or minor clarifications in the revised version.
Circularity Check
No significant circularity
full rationale
The paper presents two new constraint-aware heuristics (GH and AGH) for mixed-scale LLM allocation under SLO constraints and evaluates them empirically against an exact MILP solver on the Azure LLM Inference Trace (2025) plus out-of-sample stress tests. The reported results (feasibility in <1s, 260x speedup, cost proximity, controlled SLO violations) are direct measurements of algorithm runtime and solution quality on external trace data; no equations reduce these outcomes to fitted parameters defined by the same data, no self-citations bear the central claim, and no ansatz or uniqueness theorem is invoked to force the result. The derivation chain consists of algorithmic construction plus benchmark comparison and is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Azure LLM Inference Trace (2025) provides representative workloads for evaluating allocation heuristics under realistic request patterns.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.