{"total":13,"items":[{"citing_arxiv_id":"2606.29708","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Demystifying the Design Space and Best Practices for Heterogeneous LLM Inference and Serving","primary_cat":"cs.DC","submitted_at":"2026-06-29T02:24:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper organizes heterogeneous prefill-decode LLM serving into a four-axis design space and identifies three recurring boundary decisions that require joint choices.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21847","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CompPow: A Case for Component-level GPU Power Management","primary_cat":"cs.AR","submitted_at":"2026-05-21T00:44:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"CompPow makes the case that component-aware power management inside GPUs can yield 10% higher energy efficiency and 5% better performance for ML workloads.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20315","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-19T17:50:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mix-Quant quantizes prefilling to NVFP4 and keeps BF16 for decoding in agentic LLMs, achieving up to 3x prefilling speedup while largely preserving task performance on long-context and agentic benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11999","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures","primary_cat":"cs.DC","submitted_at":"2026-05-12T11:48:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Every model is served via vLLM [ 8] in BF16, taken directly from HuggingFace with no custom kernels-the scenario most practitioners actually face. We run on a single NVIDIA H200 SXM (HBM3e, 4.8 TB/s bandwidth, 989 TFLOPS BF16 dense peak, 700 W TDP). This single-card setup directly mirrors thedecode pool model widely adopted in industry: disaggregated serving systems [ 17,25] route prefill and decode requests to dedicated GPU pools, so each decode-pool card sees a decode-only workload-exactly what we measure. Energy is measured via NVML power sampling at 50 ms intervals, integrated with the trapezoidal rule; for operations shorter than 100 ms ( ≈44% of prefill configs) we fall back to the product of snapshot power and wall-clock latency."},{"citing_arxiv_id":"2605.11333","ref_index":84,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces","primary_cat":"cs.DC","submitted_at":"2026-05-11T23:38:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11232","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack","primary_cat":"cs.AI","submitted_at":"2026-05-11T20:47:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Workload-aware optimizations for LLM serving in AML and fraud detection yield substantial gains in throughput, latency, and GPU utilization on synthetic compliance prompts.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tomatic Prefix Caching reuses KV-cache blocks for matching prefixes, while LMCache extends reuse across GPU, CPU, disk, and remote tiers [ 5, 6, 7]. Related systems also disaggregate prefill and decode: Splitwise and DistServe show that sepa- rating compute-heavy prompt processing from memory-heavy token generation can improve goodput and reduce head-of-line blocking [8, 9]. Structured generation further motivates workload-aware serv- ing. SGLang introduces RadixAttention for KV reuse across structured language-model programs [ 10]. Speculative de- coding accelerates generation by using draft tokens verified by the target model [ 11], with EAGLE-3 improving the ap- proach through direct token prediction and multi-layer feature"},{"citing_arxiv_id":"2605.07985","ref_index":30,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation","primary_cat":"cs.DC","submitted_at":"2026-05-08T16:44:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dooly reduces LLM inference profiling GPU-hours by 56.4% across 12 models while keeping simulation MAPE under 5% for TTFT and 8% for TPOT by making profiling configuration-agnostic and redundancy-aware.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Since exhaustively running each configuration on a live system to measure performance is impractical, profile-based simulators [3, 5, 11, 12, 24, 17] have become the standard tool for estimating LLM performance before deployment. These simulators treat the GPU as a black box, collecting execution latency of operations and modules via PyTorch's Kineto profiler [30] which records GPU activity through NVIDIA's CUPTI [1]. We define anoperationas a low-level CPU call that dispatches a GPU kernel (e.g., aten::linear), and amoduleas a coarser-grained PyTorch block composed of multiple operations (e.g.,RowParallelLinear). Two limitations undermine the cross-configuration exploration these simulators are meant to enable."},{"citing_arxiv_id":"2605.01708","ref_index":21,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving","primary_cat":"cs.DC","submitted_at":"2026-05-03T04:22:51+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"exponent values with fixed 4-bit codes and stores rare values in a small escape buffer with their positions and raw exponents. This design enables GPU-friendly parallel decoding and reconstructs the original KV cache bit-exactly. such settings, KV Cache transfer must traverse slower and more constrained inter-node or inter-cluster links, making communication overhead a growing bottleneck [21]. This transfer bottleneck is especially pronounced for long-input workloads, such as document-level question answering, codebase understanding, and multi-document summarization, where prefill pro- duces a large KV cache that must be transferred to the decode workers. Recent serving deployments increasingly separate prefill and decode across different hardware tiers or even different datacenters,"},{"citing_arxiv_id":"2604.16007","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs","primary_cat":"cs.AR","submitted_at":"2026-04-17T12:29:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Memory Type Latency Capacity Bandwidth Shoreline𝑝 bg (mW/GB)𝑒 read (pJ/bit)𝑒 write (pJ/bit) Note On-Chip Memory SRAM 1 Die∼1.5𝑛𝑠256 MB 4TB/s -∼10𝑘−50𝑘 ∗ ∼0.1 ∗ ∼0.1 ∗ Obtained by Experiments Off-Chip Memory HBM3E 8H 1 Stack∼100𝑛𝑠24 GB 1 TB/s∼11𝑚𝑚∼50−100 ∗ ∼3 ∗ ∼3.6 ∗ Obtained by Experiments HBM4 12H 1 Stack∼100𝑛𝑠36 GB 2 TB/s∼15𝑚𝑚∼50−100 † ∼2.2 † ∼2.4 † 40% energy efficiency than HBM3e [42]. LPDDR5X 1 Pkg[29]∼50𝑛𝑠16 GB 76.8 GB/s∼4.1𝑚𝑚∼7.65 ∗ ∼5 ∗ ∼6.5 ∗ Obtained by Experiments LPDDR6 1 Pkg[22]∼50𝑛𝑠16 GB 172.8 GB/s∼4.5𝑚𝑚∼6.12 † ∼3.75 † ∼4.87 † 20% to 30% more energy effi- cient than LPDDR5X [30] GDDR6 1 Chip[28]∼12𝑛𝑠2 GB 64 GB/s∼11𝑚𝑚∼100 ∗ ∼7 ∗ ∼8.8 ∗ Obtained by Experiments GDDR7 1 Chip∼12𝑛𝑠3 GB 128 GB/s∼11𝑚𝑚∼120 † ∼5.6 † ∼7.0 † 20% more energy efficient than"},{"citing_arxiv_id":"2605.05219","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sparse Prefix Caching for Hybrid and Recurrent LLM Serving","primary_cat":"cs.LG","submitted_at":"2026-04-17T09:24:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07760","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reduced-Mass Orbital AI Inference via Integrated Solar, Compute, and Radiator Panels","primary_cat":"cs.DC","submitted_at":"2026-04-09T03:28:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Integrated solar-compute-radiator panels enable orbital satellites to achieve over 100 kW of AI inference compute per metric ton launched, supporting thousands of simultaneous large language model sessions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.09999","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production","primary_cat":"cs.DC","submitted_at":"2025-05-15T06:24:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ServeGen characterizes production LLM inference workloads across model types and generates realistic per-client composed workloads that reduce under-provisioning by 50% in a production validation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.14294","ref_index":272,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Efficient Inference for Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-04-22T15:53:08+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ation of the mapping relationship between the virtual ad- dress space of the KV cache and its corresponding physical address space. To enhance the efficiency of the attention operator, the loading pattern of the KV cache must be tai- lored to facilitate contiguous memory access. For instance, in the case of the PagedAttention by vLLM [51], the storage 26 Serving System Distributed Systems Splitwise [272], TetriInfer [273], Dist- Serve [274], SpotServe [275], Infinite-LLM [276] Scheduling ORCA [277], vLLM [51], LightLLM [278], DeepSpeed-FastGen [279], FastServe [280], VTC [281] Batching ORCA [277], vLLM [51], Sarathi [282], DeepSpeed-FastGen [279], Sarathi-Serve [283], LightLLM [278] Memory Management S3 [284], vLLM [51], LightLLM [278], FlashIn- fer [285]"}],"limit":50,"offset":0}