pith. sign in

arxiv: 2606.07248 · v1 · pith:7VRENOHDnew · submitted 2026-06-05 · 💻 cs.DC

Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

Pith reviewed 2026-06-27 20:57 UTC · model grok-4.3

classification 💻 cs.DC
keywords LLM inferenceschedulinghead-of-line blockingresponse length predictionshortest job firstserial backendsedge deploymentXGBoost
0
0 comments X

The pith

Clairvoyant predicts LLM response lengths from 19 lexical features to enable shortest-job-first admission ordering in serial backends.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Serial LLM inference servers process one request at a time under first-come-first-served rules, so a long generation job can delay many short queries by minutes at high load. Clairvoyant adds a lightweight proxy that estimates output length from text features alone and reorders the admission queue to run shorter jobs first. The proxy achieves this with an XGBoost model that runs in 0.029 ms per request and reaches 62-96% ranking accuracy on natural conversation data. End-to-end tests on an RTX 4090 report 70-76% lower median latency for short requests when the queue holds 100 items and 17% lower latency under steady Poisson traffic. The design requires no changes to the backend and works for any OpenAI-compatible serial server.

Core claim

The paper claims that an ONNX-exported XGBoost classifier trained on 19 lexical features can predict relative response lengths with 62-96% in-distribution ranking fidelity, which is sufficient to implement predictive shortest-job-first admission control. This ordering mitigates head-of-line blocking in serial LLM backends without the high VRAM cost of continuous batching, producing measured end-to-end P50 latency reductions of 70-76% for short requests under maximum queue pressure of 100 concurrent requests and 17% under steady-state Poisson arrivals at load factor 0.74.

What carries the argument

The ONNX-exported XGBoost classifier on 19 lightweight lexical features, which supplies fast per-request length-class predictions used only for relative admission ordering.

If this is right

  • Short requests experience 70-76% lower P50 latency under 100 concurrent requests.
  • Short requests experience 17% lower P50 latency under steady-state Poisson arrivals at ρ=0.74.
  • The proxy adds 0.029 ms overhead per request and needs no backend changes.
  • Natural conversation logs, not curated instruction datasets, are required for effective training because instruction data under-represents long responses.
  • The system applies to any serial OpenAI-compatible backend such as Ollama or llama.cpp.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Edge and local deployments with tight VRAM can now mix short and long requests without hardware upgrades.
  • Ranking-focused prediction may allow similar lightweight proxies for other ordering policies such as priority or deadline scheduling.
  • Periodic retraining on fresh conversation logs will likely be needed when workload domains shift.

Load-bearing premise

The reported 62-96% ranking fidelity of the lexical XGBoost model is high enough to produce the claimed end-to-end latency reductions when used to order admissions.

What would settle it

Deploy the proxy on a serial backend with a recorded mixed workload, measure actual response lengths and queue latencies, and check whether short-request P50 latency falls by 70-76% under the same 100-request queue pressure.

Figures

Figures reproduced from arXiv: 2606.07248 by Aravind Sundaresan.

Figure 1
Figure 1. Figure 1: Illustrative timeline of HOLB under FCFS vs. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Clairvoyant intercepts API requests, predicts output length in 0.029 ms via lexical feature extraction and ONNX inference, and dispatches to the backend in SJF order. The response path is a transparent pass-through. The starvation guard promotes any request waiting longer than τ = 3 × µshort regardless of predicted P(Long). Evaluated on Apple M1 (Ollama, Gemma3:4b) and RTX 4090 (Ollama, Gemma3:4b, Llama3.1… view at source ↗
Figure 3
Figure 3. Figure 3: SJF latency reduction for short requests vs. queue utilisation [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
read the original abstract

Serial LLM inference backends -- such as Ollama -- process requests one at a time under FCFS admission, causing Head-of-Line Blocking (HOLB) under mixed workloads at high utilisation: short factual queries can be delayed by minutes behind long generation jobs. While cloud-scale deployments mitigate HOLB via continuous batching (vLLM, Orca), these solutions require tens of GB of VRAM for concurrent KV-caches -- infeasible for memory-constrained edge and local deployments that rely on serial request dispatch. We present \clairvoyant, a drop-in sidecar proxy for any serial OpenAI-compatible backend (e.g., Ollama, llama.cpp). \clairvoyant predicts response length from 19 lightweight lexical features via an ONNX-exported XGBoost classifier, achieving 0.029\,ms per-request latency (four orders of magnitude below typical generation time). Because admission scheduling depends on relative ordering rather than exact prediction, the system optimises ranking fidelity, achieving 62--96\% in-distribution and 52--66\% cross-distribution accuracy across natural conversation datasets. We find that curated instruction datasets are degenerate training sources for length prediction: GPT-imposed brevity constraints reduce Long-class representation to under 0.02\% of examples, making natural conversation logs the only viable training source. End-to-end GPU benchmarks on an RTX~4090 show 70--76\% P50 latency reduction for short requests under maximum queue pressure (100 concurrent requests) and 17\% under steady-state Poisson arrivals ($\rho=0.74$). \clairvoyant is open-source and requires no modifications to the inference backend.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents Clairvoyant, a drop-in sidecar proxy for serial OpenAI-compatible LLM backends (e.g., Ollama) that uses an ONNX-exported XGBoost classifier on 19 lightweight lexical features to predict response lengths and enable SJF-style admission ordering. This aims to mitigate head-of-line blocking for short requests under mixed workloads. Key claims include 0.029 ms per-request prediction latency, 62--96% in-distribution and 52--66% cross-distribution ranking accuracy, and end-to-end GPU benchmark results on an RTX 4090 showing 70--76% P50 latency reduction for short requests at maximum queue pressure (100 concurrent) and 17% under steady-state Poisson arrivals (ρ=0.74). Natural conversation logs are identified as the only viable training source due to degeneracy in curated instruction datasets.

Significance. If the empirical results hold under scrutiny, this work provides a practical, low-overhead approach to improving responsiveness in memory-constrained local/edge LLM deployments where continuous batching is infeasible. The emphasis on ranking fidelity (rather than exact length prediction) and the open-source release are notable strengths. The identification of training data issues with instruction datasets is a useful observation for the community.

major comments (2)
  1. [Abstract and §5 (benchmarks)] The central end-to-end latency claims (70--76% P50 reduction under queue pressure and 17% under Poisson) rest on benchmarks whose experimental setup, number of runs, statistical significance, error bars, and data exclusion rules are not detailed in the reported results. This directly affects verifiability of the claim that the reported ranking fidelity produces the observed gains.
  2. [§4 (model evaluation) and §5 (end-to-end)] No ablation or sensitivity analysis is provided to demonstrate that the 62--96% in-distribution ranking fidelity is sufficient to deliver the claimed latency reductions when used for admission ordering, nor how performance degrades with lower fidelity (e.g., the 52--66% cross-distribution case).
minor comments (1)
  1. [Abstract] The abstract states concrete numerical results but does not reference the specific tables or figures containing the underlying data; cross-references would improve traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve verifiability and analysis of the results.

read point-by-point responses
  1. Referee: [Abstract and §5 (benchmarks)] The central end-to-end latency claims (70--76% P50 reduction under queue pressure and 17% under Poisson) rest on benchmarks whose experimental setup, number of runs, statistical significance, error bars, and data exclusion rules are not detailed in the reported results. This directly affects verifiability of the claim that the reported ranking fidelity produces the observed gains.

    Authors: We agree that the experimental details require expansion for verifiability. The revised §5 will explicitly state the number of independent runs (with random seeds), the method for assessing statistical significance (e.g., confidence intervals or paired tests), inclusion of error bars on all latency figures, and confirmation that no data were excluded beyond standard queue-completion filtering. These additions will directly support the latency claims. revision: yes

  2. Referee: [§4 (model evaluation) and §5 (end-to-end)] No ablation or sensitivity analysis is provided to demonstrate that the 62--96% in-distribution ranking fidelity is sufficient to deliver the claimed latency reductions when used for admission ordering, nor how performance degrades with lower fidelity (e.g., the 52--66% cross-distribution case).

    Authors: We acknowledge the absence of a dedicated sensitivity analysis. While the reported cross-distribution results provide partial evidence of behavior under lower fidelity, we will add a new analysis in §5 that simulates admission ordering with controlled prediction noise at fidelity levels matching the observed range (52--96%). This will quantify how latency reductions vary with ranking accuracy and confirm sufficiency of the achieved fidelity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical system: an XGBoost classifier trained on 19 lexical features to predict response length for SJF-style admission ordering, with measured ranking fidelity on held-out data and direct end-to-end latency benchmarks on RTX 4090 hardware under controlled workloads. No equations or derivations are claimed; the central results (latency reductions, prediction latency, fidelity ranges) are obtained from independent measurements rather than reducing to the training fit by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The argument that relative ordering suffices is supported by explicit cross-distribution accuracy numbers and is not tautological with the model training.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on a trained XGBoost classifier whose parameters are fitted to conversation data and on the domain assumption that lexical features yield usable ranking for SJF ordering; no new physical entities or ungrounded mathematical axioms are introduced.

free parameters (2)
  • XGBoost model parameters
    Fitted during training on natural conversation logs to classify response length categories.
  • 19 lexical feature definitions and selection
    Chosen and engineered to support the length prediction task.
axioms (1)
  • domain assumption Lexical features extracted from prompts correlate sufficiently with eventual response length to support useful ranking for scheduling
    Invoked when the paper reports 62-96% in-distribution and 52-66% cross-distribution accuracy and links it to latency gains.

pith-pipeline@v0.9.1-grok · 5829 in / 1558 out tokens · 29222 ms · 2026-06-27T20:57:27.777089+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve

    Amey Agrawal, Niket Kedia, Ashish Panwar, Jayashree Mohan, Jinwoo Kwak, Gregory R Ganger, Anton Tumanov, and Ramachandran Ramjee. Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 117–134, 2024

  2. [2]

    CodeAlpaca: An instruction-following LLaMA model for code generation.https: //github.com/sahil280114/codealpaca, 2023

    Sahil Chaudhary. CodeAlpaca: An instruction-following LLaMA model for code generation.https: //github.com/sahil280114/codealpaca, 2023. Accessed: 2026-06-02

  3. [3]

    XGBoost: A scalable tree boosting system

    Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM, 2016

  4. [4]

    Databricks Dolly 15K: An open-source dataset for instruction-following large language models.https://github.com/databrickslabs/dolly, 2023

    Databricks. Databricks Dolly 15K: An open-source dataset for instruction-following large language models.https://github.com/databrickslabs/dolly, 2023. Accessed: 2026-06-02

  5. [5]

    Efficient LLM scheduling by learning to rank

    Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, and Hao Zhang. Efficient LLM scheduling by learning to rank. InAdvances in Neural Information Processing Systems (NeurIPS 37), 2024

  6. [6]

    Cambridge University Press, 1st edition, 2013

    Mor Harchol-Balter.Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press, 1st edition, 2013

  7. [7]

    S3: Increasing GPU utilization during generative inference for higher throughput

    Yunho Jin, Xiaoxuan Pan, Alvin Wang, Guohao Yu, Zhihao Zhu, Ion Stoica, and Hao Zhang. S3: Increasing GPU utilization during generative inference for higher throughput. InAdvances in Neural Information Processing Systems (NeurIPS 36), 2023

  8. [8]

    Volume 1: Theory

    Leonard Kleinrock.Queueing Systems. Volume 1: Theory. Wiley-Interscience, 1975

  9. [9]

    OpenAssistant conversations: Democratizing large language model alignment

    Andreas Köpf et al. OpenAssistant conversations: Democratizing large language model alignment. https://huggingface.co/datasets/OpenAssistant/oasst1, 2023. Accessed: 2026-06-02

  10. [10]

    Efficient memory management for large language model serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ’23), pages 611–626. ACM, 2023

  11. [11]

    LMSYS-Chat-1M: A large-scale real-world LLM conversation dataset

    LMSYS Org. LMSYS-Chat-1M: A large-scale real-world LLM conversation dataset. https:// huggingface.co/datasets/lmsys/lmsys-chat-1m, 2023. Accessed: 2026-06-02

  12. [12]

    ONNX Runtime.https://onnxruntime.ai, 2021

    ONNX Runtime Developers. ONNX Runtime.https://onnxruntime.ai, 2021. Accessed: 2026-06-02

  13. [13]

    Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Nedeljko Vujic, Zhenhua Liu, Chenyang Wang, Ioannis Stavrakakis, Stratis Ioannidis, and David A. Wood. Efficient interactive LLM serving with proxy model-based sequence length prediction.arXiv preprint arXiv:2404.08509, 2024

  14. [14]

    Linus E. Schrage. A proof of the optimality of the shortest remaining processing time discipline. Operations Research, 16(3):687–690, 1968. 16

  15. [15]

    Liu, and Christopher D

    Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer- generator networks. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1073–1083. Association for Computational Linguistics, 2017

  16. [16]

    ShareGPT dataset.https://sharegpt.com, 2023

    ShareGPT. ShareGPT dataset.https://sharegpt.com, 2023. Accessed: 2026-06-02

  17. [17]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model.https: //github.com/tatsu-lab/stanford_alpaca, 2023. Accessed: 2026-06-02

  18. [18]

    Fast Distributed Inference Serving for Large Language Models

    Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models.arXiv preprint arXiv:2305.05920, 2023

  19. [19]

    Predicting LLM output length via entropy-guided representations

    Huanyi Xie, Yubin Chen, Liangyu Wang, Lijie Hu, and Di Wang. Predicting LLM output length via entropy-guided representations. InThe Fourteenth International Conference on Learning Representations (ICLR 2026), 2026

  20. [20]

    Orca: A distributed serving system for transformer-based generative models

    Gyeong-In Yu, Jeongmin Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538. USENIX Association, 2022. 17