Recognition: 2 theorem links
· Lean TheoremThe Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
Pith reviewed 2026-05-15 06:36 UTC · model grok-4.3
The pith
The Workload-Router-Pool architecture organizes LLM inference optimization into a 3x3 interaction matrix that maps prior results and flags twenty-one open research directions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Workload-Router-Pool architecture is a three-dimensional framework for LLM inference optimization. Workload characterizes what the fleet serves, Router determines how each request is dispatched, and Pool defines where inference runs. Mapping prior work onto the 3x3 interaction matrix identifies covered cells and open cells, and the paper proposes twenty-one concrete research directions at the intersections, each grounded in prior measurements and tiered by maturity.
What carries the argument
The Workload-Router-Pool (WRP) architecture, a three-dimensional framework that places workload types, routing policies, and execution pools on the axes of a 3x3 matrix to expose covered areas and open research cells.
If this is right
- Fleet provisioning decisions must change when routing policies shift or when workload mixes move toward agentic and multimodal traffic.
- Safety mechanisms such as policy conflict detection and hallucination checks become more effective when combined with context-length-aware pool routing.
- Energy-efficiency gains require joint selection of router policies and heterogeneous pool configurations rather than independent tuning.
- Agentic workloads with multi-turn memory and tool selection create new cells in the matrix that need dedicated routing and pool designs.
- Standards for inference routing protocols and multi-provider APIs must account for interactions across all three WRP dimensions.
Where Pith is reading between the lines
- If the matrix proves stable across new model families, it could serve as a shared taxonomy for comparing commercial serving systems.
- Extending the framework with a fourth axis for network topology might be needed if cross-node KV-cache movement dominates latency.
- Prioritizing the engineering-ready directions first would let teams measure concrete throughput or cost improvements before tackling open research cells.
- The same matrix structure could be tested on non-LLM workloads such as diffusion model serving to check whether the three dimensions generalize.
Load-bearing premise
The three dimensions of workload, router, and pool are sufficient to capture the main interactions that matter for LLM inference optimization.
What would settle it
Demonstration of a dominant factor, such as regulatory data-locality rules or network-topology effects, that cannot be assigned to any cell in the 3x3 WRP matrix would show the framework is incomplete.
read the original abstract
Over the past year, the vLLM Semantic Router project has released a series of work spanning: (1) core routing mechanisms -- signal-driven routing, context-length pool routing, router performance engineering, policy conflict detection, low-latency embedding models, category-aware semantic caching, user-feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing -- multimodal agent routing, tool selection, CUA security, and multi-turn context memory and safety; (4) governance and standards -- inference routing protocols and multi-provider API extensions. Each paper tackled a specific problem in LLM inference, but the problems are not independent; for example, fleet provisioning depends on the routing policy, which depends on the workload mix, shifting as organizations adopt agentic and multimodal workloads. This paper distills those results into the Workload-Router-Pool (WRP) architecture, a three-dimensional framework for LLM inference optimization. Workload characterizes what the fleet serves (chat vs. agent, single-turn vs. multi-turn, warm vs. cold, prefill-heavy vs. decode-heavy). Router determines how each request is dispatched (static semantic rules, online bandit adaptation, RL-based model selection, quality-aware cascading). Pool defines where inference runs (homogeneous vs. heterogeneous GPU, disaggregated prefill/decode, KV-cache topology). We map our prior work onto a 3x3 WRP interaction matrix, identify which cells we have covered and which remain open, and propose twenty-one concrete research directions at the intersections, each grounded in our prior measurements, tiered by maturity from engineering-ready to open research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Workload-Router-Pool (WRP) architecture as a three-dimensional framework for LLM inference optimization. It distills prior vLLM Semantic Router results into characterizations of Workload (chat vs. agentic, single- vs. multi-turn, prefill- vs. decode-heavy), Router (semantic rules, bandit adaptation, RL selection), and Pool (homogeneous/heterogeneous GPUs, disaggregated prefill/decode, KV-cache topology), maps these onto a 3x3 interaction matrix to identify covered and open cells, and proposes 21 concrete research directions grounded in the project's prior measurements and tiered by maturity.
Significance. If the framework holds, the WRP matrix could serve as a useful organizing taxonomy for LLM inference research, systematically highlighting gaps at the intersections of workload characteristics, routing policies, and execution pools. The explicit grounding of the 21 directions in existing measurements from the vLLM project is a strength that could help prioritize engineering-ready versus open-research items.
major comments (2)
- [WRP architecture definition and matrix construction] Section defining the WRP dimensions and 3x3 matrix: The central claim that these three axes fully capture key interactions and allow complete mapping of prior work rests on the unargued assumption that factors such as network topology (e.g., KV-cache transfer latency across nodes) or regulatory constraints (data residency) can be reduced to Workload/Router/Pool without loss of fidelity. No explicit justification or reduction argument is provided, which directly affects the identification of 'open cells' and the completeness of the proposed directions.
- [Proposal of twenty-one research directions] Section on the 21 research directions: Several directions (particularly those involving fleet governance and multi-provider standards) are listed without explicit mapping back to specific open cells in the 3x3 matrix or to concrete prior measurements that would ground their feasibility. This weakens the claim that the directions are systematically derived from the matrix analysis.
minor comments (2)
- [Abstract and introduction] The abstract and introduction could more explicitly separate the synthesis of previously published vLLM results from any novel conceptual contribution of the WRP framing itself.
- [Matrix and directions section] Notation for the matrix cells and direction tiers would benefit from a small summary table to improve readability when the 21 directions are enumerated.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope and grounding of the WRP framework. We address both major comments below and will revise the manuscript to strengthen the justification of the dimensions and the explicit mapping of research directions.
read point-by-point responses
-
Referee: [WRP architecture definition and matrix construction] Section defining the WRP dimensions and 3x3 matrix: The central claim that these three axes fully capture key interactions and allow complete mapping of prior work rests on the unargued assumption that factors such as network topology (e.g., KV-cache transfer latency across nodes) or regulatory constraints (data residency) can be reduced to Workload/Router/Pool without loss of fidelity. No explicit justification or reduction argument is provided, which directly affects the identification of 'open cells' and the completeness of the proposed directions.
Authors: We acknowledge the absence of an explicit reduction argument. In revision we will insert a dedicated paragraph in Section 2 explaining the scope: network topology effects are already subsumed under the Pool dimension through the KV-cache topology sub-axis (our prior disaggregated prefill/decode measurements quantify cross-node transfer latencies), while regulatory constraints such as data residency are treated as workload attributes (privacy-sensitive vs. general chat) or pool restrictions (geo-fenced GPU sets). We will also state explicitly that WRP is offered as a practical organizing taxonomy derived from vLLM measurements rather than a claim of theoretical completeness, and we will note possible extensions for factors outside the current axes. This addition will make the identification of open cells more transparent. revision: yes
-
Referee: [Proposal of twenty-one research directions] Section on the 21 research directions: Several directions (particularly those involving fleet governance and multi-provider standards) are listed without explicit mapping back to specific open cells in the 3x3 matrix or to concrete prior measurements that would ground their feasibility. This weakens the claim that the directions are systematically derived from the matrix analysis.
Authors: We agree that the linkage should be stated more explicitly. The revised manuscript will include a summary table (new Table 3) that, for each of the 21 directions, lists the target (Workload, Router, Pool) cell and cites the specific prior vLLM measurement or paper that grounds its feasibility. For the fleet-governance and multi-provider directions, we will map them to the open cells at the intersection of heterogeneous pools with adaptive routers and reference the energy-efficiency analysis and multi-provider API extension results already obtained in the project. This will demonstrate the systematic derivation from the matrix. revision: yes
Circularity Check
WRP 3x3 matrix and 21 directions reduce to re-labeling of authors' own prior vLLM results
specific steps
-
renaming known result
[Abstract]
"This paper distills those results into the Workload-Router-Pool (WRP) architecture, a three-dimensional framework for LLM inference optimization. ... We map our prior work onto a 3x3 WRP interaction matrix, identify which cells we have covered and which remain open, and propose twenty-one concrete research directions at the intersections, each grounded in our prior measurements"
The 3x3 matrix is populated exclusively by re-mapping the authors' own prior vLLM papers (listed in the abstract as the source of the distillation); the identification of covered cells and the 21 directions are therefore direct outputs of that re-mapping rather than new derivations.
-
self citation load bearing
[Abstract]
"Over the past year, the vLLM Semantic Router project has released a series of work spanning: (1) core routing mechanisms -- signal-driven routing, context-length pool routing, router performance engineering, policy conflict detection, low-latency embedding models, category-aware semantic caching, user-feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing -- multimodal agent routing, tool selec"
The claim that WRP is a sufficient three-dimensional framework rests on the completeness of the enumerated self-cited project outputs; no external criterion or reduction is provided to show why network topology, regulatory constraints, or other factors can be omitted without loss.
full rationale
The paper's central derivation consists of defining the WRP dimensions from the authors' listed prior publications, then mapping those same publications onto the new 3x3 matrix to identify covered/open cells and generate 21 directions. This process is self-contained within the authors' body of work with no external derivation, benchmark, or independent validation step shown; the 'framework' and proposals are therefore equivalent to a reorganization of the input citations by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three dimensions of Workload, Router, and Pool are sufficient to capture all key interactions in LLM inference optimization.
invented entities (1)
-
Workload-Router-Pool (WRP) architecture
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The Workload–Router–Pool (WRP) architecture, a three-dimensional framework... Workload characterizes what the fleet serves... Router determines how each request is dispatched... Pool defines where inference runs
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
optimization objective is a weighted combination of cost (GPU-hours per request), accuracy, latency, and energy
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
vLLM Semantic Router Team. vLLM semantic router: Signal driven decision routing for mixture-of-modality models.arXiv preprint arXiv:2603.04444, 2026
-
[2]
Xunzhuo Liu, Hao Wu, Huamin Chen, Bowei He, and Xue Liu. Conflict-free policy languages for probabilistic ML predicates: A framework and case study with the semantic router DSL.arXiv preprint arXiv:2603.18174, 2026
-
[3]
Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, and Huamin Chen. 98× faster LLM routing without a dedicated GPU: Flash attention, prompt compression, and near-streaming for the vLLM semantic router.arXiv preprint arXiv:2603.12646, 2026
-
[4]
mmBERT-embed-32k-2d-matryoshka: Multilingual embedding model with 2d matryoshka training
vLLM Semantic Router Team. mmBERT-embed-32k-2d-matryoshka: Multilingual embedding model with 2d matryoshka training. Hugging Face model:llm-semantic-router/mmbert-embed- 32k-2d-matryoshka, 2025
work page 2025
-
[5]
When to reason: Semantic router for vLLM
Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, and Huamin Chen. When to reason: Semantic router for vLLM. InNeurIPS Workshop on ML for Systems (MLForSys), 2025
work page 2025
-
[6]
Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, and Huamin Chen. Category- aware semantic caching for heterogeneous LLM workloads.arXiv preprint arXiv:2510.26835, 2025
-
[7]
mmBERT-32k feedback detector: User satisfaction classification for online routing adaptation
vLLM Semantic Router Team. mmBERT-32k feedback detector: User satisfaction classification for online routing adaptation. Hugging Face model:llm-semantic-router/mmbert32k-feedback- detector-lora, 2026
work page 2026
-
[8]
Token-level truth: Real-time hallucination detection for production LLMs
vLLM Semantic Router Team. Token-level truth: Real-time hallucination detection for production LLMs. vLLM Blog, 2025.https://blog.vllm.ai/2025/12/14/halugate.html
work page 2025
-
[9]
vLLM Semantic Router Team. mmBERT-32k factcheck classifier: Binary prompt classification for conditional hallucination detection. Hugging Face model:llm-semantic-router/mmbert32k- factcheck-classifier-merged, 2026
work page 2026
-
[10]
MLCommons AI safety classifier – level 1 (binary): Safe vs
vLLM Semantic Router Team. MLCommons AI safety classifier – level 1 (binary): Safe vs. unsafe content classification. Hugging Face model:llm-semantic-router/mlcommons-safety- classifier-level1-binary, 2026
work page 2026
-
[11]
vLLM Semantic Router Team. MLCommons AI safety classifier – level 2 (9-class hazard): Hier- archical content safety classification. Hugging Face model:llm-semantic-router/mlcommons- safety-classifier-level2-hazard, 2026
work page 2026
-
[12]
Token-budget- aware pool routing for cost-efficient LLM inference.arXiv preprint, 2026
Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, and Xue Liu. Token-budget- aware pool routing for cost-efficient LLM inference.arXiv preprint, 2026. 38
work page 2026
-
[13]
Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, and Xue Liu. FleetOpt: Analytical fleet provisioning for LLM inference with compress-and-route as implementation mechanism.arXiv preprint arXiv:2603.16514, 2026
-
[14]
Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, and Xue Liu. inference- fleet-sim: A queueing-theory-grounded fleet capacity planner for LLM inference.arXiv preprint arXiv:2603.16054, 2026
-
[15]
Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, and Xue Liu. The 1/W law: An analytical study of context-length routing topology and GPU generation gains for LLM inference energy efficiency.arXiv preprint arXiv:2603.17280, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
Adaptive vision-language model routing for computer use agents.arXiv preprint arXiv:2603.12823, 2026
Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, and Huamin Chen. Adaptive vision-language model routing for computer use agents.arXiv preprint arXiv:2603.12823, 2026
-
[17]
Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, and Xue Liu. Outcome-aware tool selection for semantic routers: Latency-constrained learning without LLM inference.arXiv preprint arXiv:2603.13426, 2026
-
[18]
Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, and Huamin Chen. Visual confused deputy: Exploiting and defending perception failures in computer-using agents.arXiv preprint arXiv:2603.14707, 2026
-
[19]
OpenClaw: Personal AI assistant with a local-first gateway
OpenClaw contributors. OpenClaw: Personal AI assistant with a local-first gateway. Open- source software (MIT License), 2026. Repository: https://github.com/openclaw/openclaw. Gateway WebSocket control plane for multi-channel agent sessions, tools, and model routing; documentation athttps://docs.openclaw.ai
work page 2026
-
[20]
Semantic inference routing protocol (SIRP)
Huamin Chen and Luay Jalil. Semantic inference routing protocol (SIRP). Internet Engineering Task Force (IETF), 2025
work page 2025
-
[21]
Huamin Chen, Luay Jalil, and N. Cocker. Multi-provider extensions for agentic AI inference APIs. Internet Engineering Task Force (IETF), Network Management Research Group, 2025
work page 2025
-
[22]
LMSYS-Chat-1M: A large-scale real-world LLM conversation dataset
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, et al. LMSYS-Chat-1M: A large-scale real-world LLM conversation dataset. InProceedings of ICLR, 2024
work page 2024
-
[23]
Splitwise: Efficient generative LLM inference using phase splitting
Pratyush Patel, Esha Choukse, Chaojie Zhang, et al. Splitwise: Efficient generative LLM inference using phase splitting. InProceedings of ISCA, 2024
work page 2024
-
[24]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In Proceedings of ICLR, 2024
work page 2024
-
[25]
Patil, Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Ion Stoica, and Joseph E
Shishir G. Patil, Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Ion Stoica, and Joseph E. Gonzalez. The Berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InProceedings of ICML, 2025
work page 2025
-
[26]
ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production
Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, and Xin Jin. ServeGen: Workload characterization and generation of large language model serving in production.arXiv preprint arXiv:2505.09999, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Han Bao, Zheyuan Zhang, Pengcheng Jing, et al. Drift-bench: Diagnosing cooperative breakdowns in LLM agents under input faults via multi-turn interaction.arXiv preprint arXiv:2602.02455, 2026
-
[28]
Xuannan Liu, Xiao Yang, Zekun Li, et al. AgentHallu: Benchmarking automated hallucination attribution of LLM-based agents.arXiv preprint arXiv:2601.06818, 2026. 39
-
[29]
Pengfei Du. Memory for autonomous LLM agents: Mechanisms, evaluation, and emerging frontiers.arXiv preprint arXiv:2603.07670, 2026
-
[30]
Beyond the context window: A cost- performance analysis of fact-based memory vs
Natchanon Pollertlam and Witchayut Kornsuwannawit. Beyond the context window: A cost- performance analysis of fact-based memory vs. long-context LLMs for persistent agents.arXiv preprint arXiv:2603.04814, 2026
-
[31]
Minki Kang, Wei-Ning Chen, Dongge Han, et al. ACON: Optimizing context compression for long-horizon LLM agents.arXiv preprint arXiv:2510.00615, 2025
-
[32]
Active context compression: Autonomous memory management in LLM agents
Nikhil Verma. Active context compression: Autonomous memory management in LLM agents. arXiv preprint arXiv:2601.07190, 2026
-
[33]
SAMULE: Self-learning agents enhanced by multi-level reflection
Yubin Ge, Salvatore Romeo, Jason Cai, et al. SAMULE: Self-learning agents enhanced by multi-level reflection. InProceedings of EMNLP, 2025. arXiv:2509.20562
-
[34]
Yifan Yu, Moyan Li, Shaoyuan Xu, et al. CORRECT: COndensed eRror RECognition via knowledge transfer in multi-agent systems.arXiv preprint arXiv:2509.24088, 2025
-
[35]
Xuanbo Su, Yingfang Zhang, Hao Luo, et al. Mistake notebook learning: Batch-clustered failures for training-free agent adaptation.arXiv preprint arXiv:2512.11485, 2025
-
[36]
Uria Franko. Dynamic system instructions and tool exposure for efficient agentic LLMs.arXiv preprint arXiv:2602.17046, 2026
-
[37]
ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering
Marianne Menglin Liu, Daniel Garcia, Fjona Parllaku, Vikas Upadhyay, Syed Fahad Allam Shah, and Dan Roth. ToolScope: Enhancing LLM agent tool use through tool merging and context-aware filtering.arXiv preprint arXiv:2510.20036, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
SMART: Self-aware agent for tool overuse mitigation.arXiv preprint arXiv:2502.11435, 2025
Cheng Qian, Emre Can Acikgoz, Hongru Wang, et al. SMART: Self-aware agent for tool overuse mitigation.arXiv preprint arXiv:2502.11435, 2025
-
[39]
Budget-aware tool-use enables effective agent scaling
Tengxiao Liu, Zifeng Wang, Jin Miao, et al. Budget-aware tool-use enables effective agent scaling. arXiv preprint arXiv:2511.17006, 2025
-
[40]
Transcending cost-quality tradeoff in agent serving via session-awareness
Yanyu Ren, Li Chen, Dan Li, et al. Transcending cost-quality tradeoff in agent serving via session-awareness. InNeurIPS, 2025
work page 2025
-
[41]
Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live
Hanchen Li et al. Continuum: Efficient and robust multi-turn LLM agent scheduling with KV cache time-to-live.arXiv preprint arXiv:2511.02230, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
llm-d Team. KV-Cache wins you can see: From prefix caching in vLLM to distributed scheduling with llm-d.https://llm-d.ai/blog/kvcache-wins-you-can-see, 2026
work page 2026
-
[43]
vLLM Community. RFC: Context-aware KV-cache retention API (prioritized evictions).https: //github.com/vllm-project/vllm/issues/37003, 2026
work page 2026
-
[44]
DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving
Yinmin Zhong, Shengyu Liu, Junda Chen, et al. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of OSDI, 2024
work page 2024
-
[45]
BerriAI. LiteLLM tool permission guardrail.https://docs.litellm.ai/docs/proxy/guardrails/ tool_permission, 2026
work page 2026
-
[46]
AAP Protocol Working Group. Agent authorization profile (AAP): OAuth 2.0 extension for agent authorization.https://www.aap-protocol.org/, 2026
work page 2026
-
[47]
Scott Rose, Oliver Borchert, Stu Mitchell, and Sean Connelly. Zero trust architecture. Technical Report SP 800-207, National Institute of Standards and Technology, 2020
work page 2020
-
[48]
Ryan Marinelli, Josef Pichlmeier, and Tamas Bisztray. Harnessing chain-of-thought metadata for task routing and adversarial prompt detection.arXiv preprint arXiv:2503.21464, 2026. 40
-
[49]
Yinwei Dai, Zhuofu Chen, Anand Iyer, et al. Aragog: Just-in-time model routing for scalable serving of agentic workflows.arXiv preprint arXiv:2511.20975, 2025
-
[50]
RouteLLM: Learning to route LLMs with preference data
Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, et al. RouteLLM: Learning to route LLMs with preference data. InProceedings of ICLR, 2025
work page 2025
-
[51]
MixLLM: Dynamic routing in mixed large language models
Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, and Haifeng Chen. MixLLM: Dynamic routing in mixed large language models. InProceedings of NAACL, 2025. arXiv:2502.18482
-
[52]
Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, et al. SageServe: Optimizing LLM serving on cloud data centers with forecast aware auto-scaling.arXiv preprint arXiv:2502.14617, 2025
-
[53]
Router-R1: Teaching LLMs multi-round routing and aggregation via reinforcement learning
Haozhen Zhang, Tao Feng, and Jiaxuan You. Router-R1: Teaching LLMs multi-round routing and aggregation via reinforcement learning. InProceedings of NeurIPS, 2025. arXiv:2506.09033
-
[54]
R2-Router: A new paradigm for LLM routing with reasoning.arXiv preprint arXiv:2602.02823, 2026
Jiaqi Xue, Qian Lou, Jiarong Xing, and Heng Huang. R2-Router: A new paradigm for LLM routing with reasoning.arXiv preprint arXiv:2602.02823, 2026
-
[55]
Xianzhi Zhang, Yue Xu, Yinlin Zhu, Di Wu, Yipeng Zhou, Miao Hu, and Guocong Quan. Adapter-augmented bandits for online multi-constrained multi-modal inference scheduling.arXiv preprint arXiv:2603.06403, 2026
-
[56]
Noppanat Wadlom, Junyi Shen, and Yao Lu. Efficient LLM serving for agentic workflows: A data systems perspective.arXiv preprint arXiv:2603.16104, 2026
-
[57]
Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference
Anish Biswas, Kanishk Goel, Jayashree Mohan, et al. Sutradhara: An intelligent orchestrator- engine co-design for tool-based agentic inference.arXiv preprint arXiv:2601.12967, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[58]
Qiaoling Chen, Zhisheng Ye, Tian Tang, et al. CONCUR: High-throughput agentic batch inference of LLM via congestion-based concurrency control.arXiv preprint arXiv:2601.22705,
-
[59]
EvoRoute: Experience-driven self-routing LLM agent systems.arXiv preprint arXiv:2601.02695, 2026
Guibin Zhang, Haiyang Yu, Kaiming Yang, et al. EvoRoute: Experience-driven self-routing LLM agent systems.arXiv preprint arXiv:2601.02695, 2026
-
[60]
Budget-aware agentic routing via boundary-guided training.arXiv preprint arXiv:2602.21227, 2026
Caiqi Zhang, Menglin Xia, Xuchao Zhang, Daniel Madrigal, Ankur Mallick, Samuel Kessler, Victor Ruehle, and Saravan Rajmohan. Budget-aware agentic routing via boundary-guided training.arXiv preprint arXiv:2602.21227, 2026
-
[61]
Elias Lumer, Faheem Nizar, Akshaya Jangiti, et al. Don’t Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks.arXiv preprint arXiv:2601.06007, 2026
-
[62]
Cheng Qian, Zuxin Liu, Shirley Kokane, et al. xRouter: Training cost-aware LLMs orchestration system via reinforcement learning.arXiv preprint arXiv:2510.08439, 2025
-
[63]
Mélange: Cost efficient large language model serving by exploiting GPU heterogeneity
Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, and Ion Stoica. Mélange: Cost efficient large language model serving by exploiting GPU heterogeneity. arXiv preprint arXiv:2404.14527, 2024
-
[64]
Mooncake: A KVCache-centric disaggregated architecture for LLM serving
Ruoyu Qin et al. Mooncake: A KVCache-centric disaggregated architecture for LLM serving. arXiv preprint arXiv:2407.00079, 2025
-
[65]
NVIDIA dynamo: Smart multi-node schedul- ing for LLM inference
NVIDIA. NVIDIA dynamo: Smart multi-node schedul- ing for LLM inference. https://developer.nvidia.com/blog/ smart-multi-node-scheduling-for-fast-and-efficient-llm-inference-with-nvidia-runai-and-nvidia-dynamo/ , 2026
work page 2026
-
[66]
Efficient memory management for large language model serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, et al. Efficient memory management for large language model serving with PagedAttention. InProceedings of SOSP, 2023. 41
work page 2023
-
[67]
Bowei He, Minda Hu, Zenan Xu, Hongru Wang, Licheng Zong, Yankai Chen, Chen Ma, Xue Liu, Pluto Zhou, and Irwin King. Search-R2: Enhancing search-integrated reasoning via actor–refiner collaboration.arXiv preprint arXiv:2602.03647, 2026
-
[68]
Jesse van Remmerden, Zaharah Bukhsh, and Yingqian Zhang. Generalizing beyond suboptimality: Offline reinforcement learning learns effective scheduling through random data.arXiv preprint arXiv:2509.10303, 2025
-
[69]
Natalie Shapira, Chris Wendler, Avery Yen, Gabriele Sarti, Koyena Pal, et al. Agents of chaos: Evaluating LLM agent vulnerabilities through real-world interactions.arXiv preprint arXiv:2602.20021, 2026
work page internal anchor Pith review arXiv 2026
-
[70]
Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack.arXiv preprint arXiv:2404.01833, 2024. USENIX Security 2025
-
[71]
J. Alex Corll. Peak+accumulation: A proxy-level scoring formula for multi-turn LLM attack detection.arXiv preprint arXiv:2602.11247, 2026
-
[72]
Justin Albrethsen, Yash Datta, Kunal Kumar, et al. DeepContext: Stateful real-time detection of multi-turn adversarial intent drift in LLMs.arXiv preprint arXiv:2602.16935, 2026
-
[73]
Shir Ashury-Tahan, Yifan Mai, Elron Bandel, et al. ErrorMap and ErrorAtlas: Charting the failure landscape of large language models.arXiv preprint arXiv:2601.15812, 2026
-
[74]
Hao Li, Yiqun Zhang, Zhaoyan Guo, et al. LLMRouterBench: A massive benchmark and unified framework for LLM routing.arXiv preprint arXiv:2601.07206, 2026
-
[75]
Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes
Mehil B. Shah, Mohammad Mehdi Morovati, Mohammad Masudur Rahman, et al. Character- izing faults in agentic AI: A taxonomy of types, symptoms, and root causes.arXiv preprint arXiv:2603.06847, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[76]
Sri Vatsa Vuddanti, Aarav Shah, Satwik Kumar Chittiprolu, Tony Song, Sunishchal Dev, Kevin Zhu, Sean O’Brien, and Maheep Chaudhary. PALADIN: Self-correcting language model agents to cure tool-failure cases.arXiv preprint arXiv:2509.25238, 2025
- [77]
-
[78]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[79]
Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Matthew Trager, Wei Xia, and Stefano Soatto. Learning when to attend: Conditional memory access for long-context LLMs.arXiv preprint arXiv:2603.17484, 2026
-
[80]
Zuhair Ahmed Khan Taha, Mohammed Mudassir Uddin, and Shahnawaz Alam. AgentCom- press: Task-aware compression for affordable large language model agents.arXiv preprint arXiv:2601.05191, 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.