{"total":15,"items":[{"citing_arxiv_id":"2605.23389","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AlignedServe: Orchestrating Prefix-aware Batching to Build a High-throughput and Computing-efficient LLM Serving System","primary_cat":"cs.DC","submitted_at":"2026-05-22T09:00:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AlignedServe uses prefix-aware batching, large CPU in-flight request pools, batch scheduling, and GPU-to-GPU KV prefetching to raise decoding throughput up to 1.98x and cut latency up to 7.4x versus prior serving systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17410","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design","primary_cat":"cs.AI","submitted_at":"2026-05-17T12:11:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper defines Computational Token Economics and introduces the Token Economics Trilemma as a framework for trade-offs in granularity, real-time performance, and optimality, while outlining a research agenda for three challenge areas.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14249","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-14T01:37:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EnergyLens predicts multi-GPU LLM inference energy consumption with 9-13% MAPE and identifies configurations with up to 52x energy efficiency differences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11999","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures","primary_cat":"cs.DC","submitted_at":"2026-05-12T11:48:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"agentic multi-turn, streaming), the GPU spends most of its time in a low-power, memory-bound state that never reaches the cap. In our measurements, decode draws 160-300 W on a 700 W GPU; a facility-level 280 W cap achieves precisely nothing. The fix is straightforward: replace power capping with static SM clock locking for decode pools. In disaggregated serving architectures (Splitwise [ 17], Dist- Serve [25]), where prefill and decode run on separate GPU pools, each pool can be locked at its phase-optimal clock -no dynamic switching required. For colocated serving (e.g. single-GPU vLLM), a conservative decode clock (780 MHz) applied globally saves 47-90 W per GPU at short-to-moderate context with negligible throughput loss. At data-centre scale (tens of thousands of GPUs),"},{"citing_arxiv_id":"2605.11333","ref_index":105,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces","primary_cat":"cs.DC","submitted_at":"2026-05-11T23:38:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07985","ref_index":42,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation","primary_cat":"cs.DC","submitted_at":"2026-05-08T16:44:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dooly reduces LLM inference profiling GPU-hours by 56.4% across 12 models while keeping simulation MAPE under 5% for TTFT and 8% for TPOT by making profiling configuration-agnostic and redundancy-aware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06113","ref_index":33,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale","primary_cat":"cs.DC","submitted_at":"2026-05-07T12:25:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BalanceRoute uses a piecewise-linear F-score (with optional short lookahead) for sticky request routing in LLM serving, reducing DP imbalance and raising end-to-end throughput versus vLLM baselines on production and Azure traces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04595","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints","primary_cat":"cs.LG","submitted_at":"2026-05-06T07:42:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02821","ref_index":15,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs","primary_cat":"cs.PF","submitted_at":"2026-05-04T16:59:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16007","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs","primary_cat":"cs.AR","submitted_at":"2026-04-17T12:29:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.01785","ref_index":113,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding","primary_cat":"cs.CL","submitted_at":"2026-02-02T08:10:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.09427","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators","primary_cat":"cs.AR","submitted_at":"2025-12-10T08:52:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ODMA raises KV-cache utilization by up to 19.25% and throughput by 23-27% on Cambricon MLU accelerators by dynamically adjusting prediction buckets and using a safety pool for LLM serving.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.09999","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production","primary_cat":"cs.DC","submitted_at":"2025-05-15T06:24:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ServeGen characterizes production LLM inference workloads across model types and generates realistic per-client composed workloads that reduce under-provisioning by 50% in a production validation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.19256","ref_index":105,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HybridFlow: A Flexible and Efficient RLHF Framework","primary_cat":"cs.LG","submitted_at":"2024-09-28T06:20:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.14294","ref_index":274,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Efficient Inference for Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-04-22T15:53:08+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"dress space of the KV cache and its corresponding physical address space. To enhance the efficiency of the attention operator, the loading pattern of the KV cache must be tai- lored to facilitate contiguous memory access. For instance, in the case of the PagedAttention by vLLM [51], the storage 26 Serving System Distributed Systems Splitwise [272], TetriInfer [273], Dist- Serve [274], SpotServe [275], Infinite-LLM [276] Scheduling ORCA [277], vLLM [51], LightLLM [278], DeepSpeed-FastGen [279], FastServe [280], VTC [281] Batching ORCA [277], vLLM [51], Sarathi [282], DeepSpeed-FastGen [279], Sarathi-Serve [283], LightLLM [278] Memory Management S3 [284], vLLM [51], LightLLM [278], FlashIn- fer [285] Fig. 17. Taxonomy of the optimization for LLM serving system."}],"limit":50,"offset":0}