arxiv: 2604.17111 · v1 · submitted 2026-04-18 · 💻 cs.DC · cs.AI

Recognition: unknown

HiveMind: OS-Inspired Scheduling for Concurrent LLM Agent Workloads

Justice Owusu Agyemang , Jerry John Kponyo , Obed Kwasi Somuah , Elliot Amponsah , Godfred Manu Addo Boakye , Kwame Opuni-Boachie Obour Agyekum

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:08 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords LLM agentsconcurrent workloadsHTTP proxyscheduling primitivesrate limitingAPI contentionbackpressureagent reliability

0 comments

The pith

An HTTP proxy applies operating system scheduling to let multiple LLM agents share rate-limited APIs without crashing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When several LLM coding agents run at once against one API, they compete like unscheduled computer processes and often fail with connection resets even when the API has spare capacity overall. HIVEMIND is a transparent proxy that inserts five scheduling rules from operating systems between the agents and the API: it controls admissions, tracks rate limits, applies backpressure, manages token budgets, and sets priority queues. No agent code needs changes, and the proxy auto-detects providers such as Anthropic or OpenAI. Tests across seven workloads with five to fifty agents show uncoordinated runs fail 72 to 100 percent of the time while the proxy drops failures to zero to 18 percent and removes 48 to 100 percent of wasted compute. The primitives work best together, though transparent retries alone account for most of the gain, and added latency stays under three milliseconds.

Core claim

HIVEMIND is a transparent HTTP proxy that coordinates concurrent LLM agents by applying five operating-system scheduling primitives to shared API endpoints. The primitives are admission control, rate-limit tracking, AIMD backpressure with circuit breaking, token budget management, and priority queuing. In seven scenarios the proxy reduces failure rates from 72-100 percent to 0-18 percent and eliminates 48-100 percent of wasted compute while requiring zero changes to existing agent code and adding under three milliseconds of overhead per request.

What carries the argument

The HIVEMIND transparent HTTP proxy that enforces five OS-inspired scheduling primitives on requests to rate-limited LLM APIs.

If this is right

Concurrent LLM agents can run at higher parallelism on shared APIs without each agent implementing its own coordination logic.
Most failed requests and the compute spent on them are eliminated when the five primitives operate together.
The proxy works for Anthropic, OpenAI, and local models through auto-detected profiles and requires no code changes.
Transparent retry is the single most important primitive, but the full set outperforms any subset.
Real deployments see negligible added latency and can therefore adopt the approach immediately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same proxy pattern could manage other contended resources such as local GPU queues for open-weight models.
Developers may now attempt larger multi-agent systems that were previously abandoned because of unpredictable API collisions.
The low-overhead design suggests the approach can be inserted into standard agent orchestration tools without redesign.

Load-bearing premise

The seven evaluation scenarios with five to fifty agents capture the contention patterns that appear in diverse real-world LLM agent workloads and API behaviors.

What would settle it

A new workload of thirty concurrent agents on an API with different rate-limit dynamics that still produces failure rates above twenty percent when HIVEMIND is inserted would show the primitives do not generalize.

Figures

Figures reproduced from arXiv: 2604.17111 by Elliot Amponsah, Godfred Manu Addo Boakye, Jerry John Kponyo, Justice Owusu Agyemang, Kwame Opuni-Boachie Obour Agyekum, Obed Kwasi Somuah.

**Figure 1.** Figure 1: Architecture of HiveMind. Agents connect to the local proxy; requests pass through five scheduling layers before reaching the upstream API. The proxy is transparent: agents require zero code changes. and the upstream LLM API provider (Figure 1). Agents make normal API calls to http://localhost:8765/v1/messages; HiveMind applies all scheduling logic before forwarding to the upstream provider. This desig… view at source ↗

**Figure 2.** Figure 2: Circuit breaker state machine. The circuit opens on sustained errors, transitions to half-open after a cooldown, and closes on a successful probe request. tracted from API response bodies. At 85% utilisation, the agent receives a warning. At 100%, the agent is checkpointed (state saved to disk) and stopped, analogous to the OS OOM killer. 3.5 Priority Queue with Dependency DAG Tasks are ordered by: (1) p… view at source ↗

**Figure 3.** Figure 3: Failure rates by scenario. Direct mode (red) fails catastrophically at 10+ agents; HiveMind (green) reduces failures to 0–18%. At 5 agents, both modes succeed—there is no contention. At 10+ agents, uncoordinated execution fails catastrophically (72–100% failure rate), while HiveMind reduces failures to 0– 18%. The residual failures in replay-11 and stress scenarios arise from error injection rates that e… view at source ↗

**Figure 4.** Figure 4: Scaling behaviour. Left: agents that complete successfully. Right: effective throughput (tasks/min). Direct mode throughput drops to zero beyond 5 agents [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Wasted tokens by scenario (thousands). Direct mode (red) wastes 1–12K tokens per scenario; HiveMind (green) reduces waste to near-zero. by 96–97%. 6 Related Work Agent orchestration frameworks. LangChain [7], CrewAI [8], AutoGen [9], and Semantic Kernel [10] focus on agent composition— chains, crews, multi-agent conversations—but not on resource management. They assume the API is always available and del… view at source ↗

read the original abstract

When multiple LLM coding agents share a rate-limited API endpoint, they exhibit resource contention patterns analogous to unscheduled OS processes competing for CPU, memory, and I/O. In a motivating incident, 3 of 11 parallel agents died from connection resets and HTTP 502 errors - a 27% failure rate - despite the API having sufficient aggregate capacity to serve all 11 sequentially. We present HIVEMIND, a transparent HTTP proxy that applies five OS-inspired scheduling primitives - admission control, rate-limit tracking, AIMD backpressure with circuit breaking, token budget management, and priority queuing - to eliminate the failure modes caused by uncoordinated parallel execution. The proxy requires zero modifications to existing agent code and supports Anthropic, OpenAI, and local model APIs via auto-detected provider profiles. Our evaluation across seven scenarios (5-50 concurrent agents) shows that uncoordinated agents fail at 72-100% rates under contention, while HIVEMIND reduces failures to 0-18% and eliminates 48-100% of wasted compute. An ablation study reveals that transparent retry - not admission control - is the single most critical primitive, but the primitives are most effective in combination. Real-world validation against Ollama confirms that HIVEMIND adds under 3ms of proxy overhead per request. The system is open-source under the MIT license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HiveMind applies OS scheduling primitives to LLM agent API contention via a transparent no-mod proxy, with clear practical gains in tested cases but thin details on scenario realism.

read the letter

The main thing to know is that HiveMind takes established operating system scheduling concepts and applies them to the problem of multiple LLM agents competing for rate-limited API access through a simple transparent proxy. This delivers measurable improvements in their tests, but the evaluation leaves some questions about real-world applicability. What stands out as new is the combination of five specific primitives—admission control, rate-limit tracking, AIMD backpressure with circuit breaking, token budget management, and priority queuing—implemented in a way that needs no changes to existing agent code. The support for different providers like Anthropic, OpenAI, and local models via auto-detection is a nice touch for usability. They also include an ablation that identifies transparent retry as the most important single piece while showing the value of using them together. The low overhead of under 3ms per request and the open-source release under MIT are practical pluses. The paper does well at highlighting the issue with the example of 11 agents where 3 failed despite enough capacity, and the quantitative results across scenarios with 5 to 50 agents show uncoordinated setups failing 72-100% of the time versus much lower with the system. The soft spots are around the evaluation details. The abstract does not describe exactly how the contention was generated in those seven scenarios, such as the specific rate limits used, how errors were injected, or the distribution of requests. Without that, it's difficult to assess whether the failure reductions would hold for actual production workloads on real APIs. The single motivating incident is too limited to support broad claims. If the tests use synthetic conditions that align closely with the proxy's design, the benefits might not generalize as strongly. This work is for practitioners and researchers focused on deploying concurrent LLM agents in environments where API limits cause problems. Someone looking for a ready-to-use tool or ideas for similar systems would find it useful. It deserves a serious referee because the contribution is grounded in real deployment pain and the implementation appears straightforward to verify. I would recommend sending it to peer review, with feedback to expand on the experimental setup and workload characteristics to strengthen the evidence.

Referee Report

3 major / 2 minor

Summary. The paper presents HiveMind, a transparent HTTP proxy implementing five OS-inspired scheduling primitives (admission control, rate-limit tracking, AIMD backpressure with circuit breaking, token budget management, and priority queuing) to mitigate resource contention when multiple LLM coding agents share rate-limited APIs from providers such as Anthropic, OpenAI, and Ollama. It reports a motivating incident with 27% failure rate among 11 agents and claims that across seven scenarios (5-50 concurrent agents), uncoordinated execution yields 72-100% failure rates while HiveMind reduces failures to 0-18%, eliminates 48-100% of wasted compute, adds under 3 ms overhead, and benefits most from the combined primitives per an ablation study. The system requires zero agent code changes and is released open-source under MIT.

Significance. If the evaluation holds and generalizes, this would be a practically significant systems contribution for concurrent LLM agent deployments, demonstrating a lightweight, provider-agnostic proxy that applies established OS scheduling ideas to reduce failures and wasted compute without modifying agent code. The low measured overhead, open-source release, and ablation identifying the value of combined primitives are strengths that could influence production tooling for multi-agent workflows.

major comments (3)

[Evaluation] Evaluation section: the central empirical claims (uncoordinated failure rates of 72-100% vs. HIVEMIND 0-18%, plus 48-100% wasted compute elimination) depend on the seven scenarios faithfully modeling real rate-limited API contention, but the manuscript provides no details on how contention was generated (exact rate limits, error injection, request size distributions, or whether agents were modified to trigger specific failure modes). This is load-bearing for assessing whether the measured improvements transfer beyond the synthetic setups to production Anthropic/OpenAI/Ollama behaviors.
[Abstract] Abstract and motivating incident: the single incident (3/11 agents failing) is cited to motivate the work, but with such a small sample it cannot robustly anchor the generalization claims made for the seven scenarios spanning 5-50 agents; the manuscript should quantify variability across runs or provide statistical significance for the failure reductions.
[Evaluation] Ablation study: the claim that transparent retry is the single most critical primitive (yet most effective in combination) is presented without reporting the per-primitive failure rates or wasted compute numbers from the ablation; this makes it impossible to verify the relative contributions and whether the combination effect is additive or synergistic.

minor comments (2)

[Abstract] The abstract states support for 'local model APIs via auto-detected provider profiles' but does not specify which local servers (beyond the Ollama overhead measurement) or how profile detection works; a brief description would improve clarity.
[Evaluation] The paper should include a table summarizing the seven scenarios (agent count, workload type, API provider, measured failure rates with and without HIVEMIND) to make the quantitative results easier to parse.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important areas for improving the clarity, reproducibility, and rigor of the evaluation. We address each major comment point by point below.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the central empirical claims (uncoordinated failure rates of 72-100% vs. HIVEMIND 0-18%, plus 48-100% wasted compute elimination) depend on the seven scenarios faithfully modeling real rate-limited API contention, but the manuscript provides no details on how contention was generated (exact rate limits, error injection, request size distributions, or whether agents were modified to trigger specific failure modes). This is load-bearing for assessing whether the measured improvements transfer beyond the synthetic setups to production Anthropic/OpenAI/Ollama behaviors.

Authors: We agree that the current manuscript lacks sufficient detail on the experimental setup, which limits assessment of how faithfully the scenarios model production contention. In the revised version, we will expand the Evaluation section with a new subsection describing the exact rate-limit configurations applied for each provider profile, the request size and token distributions used to generate workloads, the process for inducing contention (including how API responses such as rate-limit errors were handled), and explicit confirmation that no modifications were made to any agent code. These additions will be based on the test harness used for the reported results and will directly address transferability concerns. revision: yes
Referee: [Abstract] Abstract and motivating incident: the single incident (3/11 agents failing) is cited to motivate the work, but with such a small sample it cannot robustly anchor the generalization claims made for the seven scenarios spanning 5-50 agents; the manuscript should quantify variability across runs or provide statistical significance for the failure reductions.

Authors: The single motivating incident is presented strictly as an anecdotal observation that prompted the work, not as statistical evidence supporting the main claims. The primary results and generalization rest on the seven controlled scenarios. We will revise the abstract, introduction, and Evaluation section to explicitly distinguish the incident from the empirical evaluation and will add measures of variability (such as standard deviation or min/max ranges across repeated runs) for the reported failure rates and wasted-compute figures to provide the requested statistical context. revision: partial
Referee: [Evaluation] Ablation study: the claim that transparent retry is the single most critical primitive (yet most effective in combination) is presented without reporting the per-primitive failure rates or wasted compute numbers from the ablation; this makes it impossible to verify the relative contributions and whether the combination effect is additive or synergistic.

Authors: We acknowledge that the ablation results are summarized at a high level without the underlying per-primitive metrics, making independent verification difficult. In the revision, we will add a detailed table (or expanded figure) in the Evaluation section that reports failure rates and wasted-compute percentages for each primitive in isolation as well as for all relevant combinations. This will allow readers to assess the relative contribution of transparent retry and to evaluate whether the combined effect is additive or synergistic. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems paper with no derivation chain

full rationale

The paper is a purely empirical systems contribution describing an HTTP proxy implementing five scheduling primitives, evaluated via seven scenarios and one motivating incident. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text. Claims rest on direct experimental measurements of failure rates and wasted compute rather than any reduction to prior results or definitions by construction. The work is self-contained with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the contribution is an engineering system built on standard OS concepts and HTTP proxy techniques.

pith-pipeline@v0.9.0 · 5578 in / 961 out tokens · 28373 ms · 2026-05-10T06:08:42.774390+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Safe Bilevel Delegation (SBD): A Formal Framework for Runtime Delegation Safety in Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

SBD is a bilevel optimization framework that learns context-dependent safety weights for runtime task delegation in hierarchical multi-agent systems, with continuous authority transfer alpha and theoretical guarantees...

Reference graph

Works this paper leans on

23 extracted references · cited by 1 Pith paper

[1]

Claude code: An agentic cod- ing tool

Anthropic, “Claude code: An agentic cod- ing tool.”https://docs.anthropic.com/ en/docs/claude-code, 2025. Accessed: 2026-04-15

2025
[2]

Codex CLI: Open-source cod- ing agent

OpenAI, “Codex CLI: Open-source cod- ing agent.”https://github.com/openai/ codex, 2025. Accessed: 2026-04-15

2025
[3]

Cursor: The AI code edi- tor

Anysphere Inc., “Cursor: The AI code edi- tor.”https://cursor.com, 2024. Accessed: 2026-04-15

2024
[4]

GitHub copilot

GitHub, “GitHub copilot.”https: //github.com/features/copilot, 2024. Accessed: 2026-04-15. 11

2024
[5]

Silberschatz, P

A. Silberschatz, P. B. Galvin, and G. Gagne,Operating System Concepts. Wiley, 10th ed., 2018

2018
[6]

System deadlocks,

E. G. Coffman, M. J. Elphick, and A. Shoshani, “System deadlocks,”ACM Computing Surveys, vol. 3, no. 2, pp. 67– 78, 1971

1971
[7]

LangChain

H. Chase, “LangChain.”https://github. com/langchain-ai/langchain, 2022. Ac- cessed: 2026-04-15

2022
[8]

Moura, “CrewAI.”https://github

J. Moura, “CrewAI.”https://github. com/joaomdmoura/crewAI, 2024. Accessed: 2026-04-15

2024
[9]

AutoGen: Enabling next- gen LLM applications via multi-agent con- versation,

Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E.Zhu, L.Jiang, X.Zhang, S.Zhang, J.Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang, “AutoGen: Enabling next- gen LLM applications via multi-agent con- versation,” 2023

2023
[10]

Semantic kernel

Microsoft, “Semantic kernel.”https: //github.com/microsoft/semantic- kernel, 2023. Accessed: 2026-04-15

2023
[11]

Devin: The first AI soft- ware engineer

Cognition Labs, “Devin: The first AI soft- ware engineer.”https://devin.ai, 2024. Accessed: 2026-04-15

2024
[12]

SWE-bench: Can language models resolve real-world GitHub issues?,

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?,” 2024

2024
[13]

SWE-agent: Agent-computer interfaces enable automated software engineering,

J. Yang, C. E. Jimenez, A. Wettig, K. Liber, S. Yao, K. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software engineering,” 2024

2024
[14]

A. S. Tanenbaum and H. Bos,Modern Op- erating Systems. Pearson, 4th ed., 2015

2015
[15]

Cooperating sequential processes,

E. W. Dijkstra, “Cooperating sequential processes,”Technical Report EWD-123, Technological University Eindhoven, 1965

1965
[16]

Congestion avoidance and control,

V. Jacobson, “Congestion avoidance and control,” inProceedings of ACM SIG- COMM, pp. 314–329, 1988

1988
[17]

Analysis of the increase and decrease algorithms for con- gestion avoidance in computer networks,

D.-M. Chiu and R. Jain, “Analysis of the increase and decrease algorithms for con- gestion avoidance in computer networks,” Computer Networks and ISDN Systems, vol. 17, no. 1, pp. 1–14, 1989

1989
[18]

M. T. Nygard,Release It! Design and De- ploy Production-Ready Software. Pragmatic Bookshelf, 2nd ed., 2018

2018
[19]

The tail at scale,

J. Dean and L. A. Barroso, “The tail at scale,” inCommunications of the ACM, vol. 56, pp. 74–80, 2013

2013
[20]

Ollama: Run large language mod- elslocally

Ollama, “Ollama: Run large language mod- elslocally.”https://ollama.com, 2024. Ac- cessed: 2026-04-15

2024
[21]

TCP congestion control,

M. Allman, V. Paxson, and E. Blan- ton, “TCP congestion control,”RFC 5681, IETF, 2009

2009
[22]

SEDA: An architecture for well- conditioned, scalable internet services,

M. Welsh, D. Culler, and E. Brewer, “SEDA: An architecture for well- conditioned, scalable internet services,” in Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP), pp. 230–243, 2001

2001
[23]

Herlihy and N

M. Herlihy and N. Shavit,The Art of Mul- tiprocessor Programming. Morgan Kauf- mann, revised 1st ed., 2012. 12

2012