{"total":16,"items":[{"citing_arxiv_id":"2605.14217","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PreFT: Prefill-only finetuning for efficient inference","primary_cat":"cs.LG","submitted_at":"2026-05-14T00:19:41+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Prefill-only adaptation of LLMs yields 1.9x higher throughput for 512 adapters on Llama 3.1 70B with near-parity performance on RL tasks and recoverable loss on SFT.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03375","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving","primary_cat":"cs.OS","submitted_at":"2026-05-05T05:33:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than prior GDS-based systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24820","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding","primary_cat":"cs.AR","submitted_at":"2026-04-27T14:06:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"from 𝑂(𝑛 2) to a near-linear level without sacrificing accuracy[46] [66] [27]. This property is particularly important for LCS. It breaks the constraint of loading all K/V during each decoding step, directly mitigating severe compute and bandwidth bottlenecks imposed by growing sequence lengths[40]. Recently, various hardware-software co-designed accelerators have emerged to optimize sparse attention[39][58]. For example, SpAtten removes secondary information through cascaded pruning[52]; Energon introduces multi-round filtering and dynamic threshold[72]; ELSA proposes hash mapping approximation[15]. However, these works are mainly designed and evaluated for short context scenar- ios (SCS), typically no longer than 4K. Transitioning to long context"},{"citing_arxiv_id":"2604.26968","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference","primary_cat":"cs.AR","submitted_at":"2026-04-19T21:34:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A unified KV cache system with architecture-specific sizing, six-tier memory from GPU to filesystems, and Bayesian prediction delivers 7.4x higher batch sizes, 70-84% hit rates, and projected 1.7-2.9x throughput gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16007","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs","primary_cat":"cs.AR","submitted_at":"2026-04-17T12:29:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"current dLLM models [33], we adopt the GSM8K benchmark [57] as a representative workload. We use the average token usage (1.4K prompt, 0.2K generation) for evaluation. The search results are shown in Table 7. 5.4.2 Large MoE Model with Sparsity.To investigate memory hier- archy design for extremely large sparse Mixture-of-Experts (MoE) models, we include Qwen3.5-397B-A17B [45] as a representative case study. This model contains 397B total parameters, requiring approximately 370 GB of storage for weights alone, while activating 9 Table 6: Pareto frontier samples selected from DSE for both prefill and decode optimization on the OSWorld task (input tokens: 90K, output tokens: 8K). Prefill Optimization Compute Memory Hierarchy Software Power Perf"},{"citing_arxiv_id":"2604.03044","ref_index":97,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency","primary_cat":"cs.CL","submitted_at":"2026-04-03T13:52:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"such as TTFT and degrades user experience. To simulate this workload, we randomly generated data with no prefix reuse, featuring input lengths of 20,000 tokens and output lengths of 100 tokens. Request rates were varied from 0.25 to 3.0 requests per second (RPS). Experiments were conducted with a Prefill-Decode (PD) disaggregated deployment. And Mooncake [97] was employed as a centralized KV cache store to manage cache across requests. Also, we provide the following deployment insights under this scenario: • PD Disaggregation Delivers Better Flexibility:Compared with aggregated deployment, PD disaggregation supports independent scaling of prefill and decode, and centralized KV caching lets us tune their instance ratio"},{"citing_arxiv_id":"2603.21354","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project","primary_cat":"cs.LG","submitted_at":"2026-03-22T18:30:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1×better than dense Llama-3.1-70B). Energy optimization therefore requires routing-pool co-design; a homogeneous fleet cannot reach the energy frontier regardless of GPU generation. 6.3 Disaggregated Prefill/Decode and KV-Cache Topology Splitwise [23] and DistServe [44] physically separate prefill and decode, serving4.48× more requests at equivalent SLO. Mooncake [64] provides KV-cache-centric disaggregation with hierarchical offloading (GPU HBM, CPU DRAM, SSD). NVIDIA Dynamo [65] enables dynamic scheduling with KV-cache offloading across memory hierarchies. Our inference-fleet-sim [14] models all three topologies (monolithic, two-pool-routed, disaggregated), revealing that disaggregation is not always beneficial: for concentrated-below workloads (Archetype 1)"},{"citing_arxiv_id":"2511.00413","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tree Training: Accelerating Agentic LLMs Training via Shared Prefix Reuse","primary_cat":"cs.LG","submitted_at":"2025-11-01T05:56:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tree Training serializes tree trajectories via DFS and uses redundancy-free partitioning to compute weighted per-token losses exactly once per token, achieving up to 6.2x training speedup on dense and MoE models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.18586","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TokenCake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications","primary_cat":"cs.DC","submitted_at":"2025-10-21T12:39:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TokenCake introduces agent-aware temporal and spatial schedulers for KV cache management in LLM multi-agent serving, claiming over 47% lower end-to-end latency and up to 16.9% better GPU memory utilization than vLLM on representative benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.15919","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling","primary_cat":"cs.DC","submitted_at":"2025-08-21T18:40:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HFX jointly designs scheduling and scaling for multi-SLO LLM serving, achieving up to 4.44x higher SLO attainment, 65.82% lower latency, and 49.81% lower cost than prior systems on multi-task workloads.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.23970","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving","primary_cat":"cs.DC","submitted_at":"2025-05-29T19:52:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.18454","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving","primary_cat":"cs.AR","submitted_at":"2025-05-19T06:37:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sandwich delivers 2.01x average end-to-end speedup and up to 3.4x latency reduction for CPU LLM serving via phase-wise hot-switching, TopoTree hardware abstraction, and fast-start dynamic kernel generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.15965","ref_index":121,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs","primary_cat":"cs.IR","submitted_at":"2025-04-22T15:05:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.","context_count":1,"top_context_role":"method","top_context_polarity":"background","context_text":"Retroformer [103], Expel [104], Synapse [105], MetaGPT [106], Learned Memory Bank [107], M+ [108] VII System Parametric Short-Term KV Management LookupFFN [109], ChunkKV [110], vLLM [111], FastServe [112], StreamingLLM [113], Orca [114], DistServe [115], LLM.int8() [116], FastGen [117], Train Large, Then Compress [118], Scissorhands [119], H2O [120], Mooncake [121], MemServe [122], SLM Serving [123], IMPRESS [124], AdaServe [125], MPIC [126], IntelLLM [127] KV Reuse KV Cache [128], Prompt Cache [83], Contextual Retrieval [84], CacheGen [129], ChunkAttention [130], RAGCache [131], SGLang [132], Ada-KV [133], HCache [134], Cake [135], EPIC [136], RelayAttention [137], Marconi [138], IKS [139], FastCache [140], Cache-Craft [141], KVLink [142],"},{"citing_arxiv_id":"2502.10248","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model","primary_cat":"cs.CV","submitted_at":"2025-02-14T15:58:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.03594","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching","primary_cat":"cs.CL","submitted_at":"2024-11-29T05:57:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BatchLLM achieves 1.3x-10.8x higher throughput than vLLM and SGLang for batched LLM inference with prefix sharing via global prefix identification, decoding-first reordering, and memory-centric token batching.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.10516","ref_index":105,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval","primary_cat":"cs.LG","submitted_at":"2024-09-16T17:59:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RetrievalAttention approximates full attention in long-context LLMs by retrieving relevant KV vectors from CPU-based ANNS indexes with an attention-aware algorithm, achieving near-full accuracy while accessing only 1-3% of the data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}