{"total":10,"items":[{"citing_arxiv_id":"2605.22733","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools","primary_cat":"cs.AI","submitted_at":"2026-05-21T17:03:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HarnessAPI derives streaming HTTP endpoints, OpenAPI UI, and MCP tools from a single handler.py plus Pydantic schemas, cutting framework boilerplate by 74%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17410","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design","primary_cat":"cs.AI","submitted_at":"2026-05-17T12:11:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper defines Computational Token Economics and introduces the Token Economics Trilemma as a framework for trade-offs in granularity, real-time performance, and optimality, while outlining a research agenda for three challenge areas.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07985","ref_index":4,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation","primary_cat":"cs.DC","submitted_at":"2026-05-08T16:44:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dooly reduces LLM inference profiling GPU-hours by 56.4% across 12 models while keeping simulation MAPE under 5% for TTFT and 8% for TPOT by making profiling configuration-agnostic and redundancy-aware.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Two limitations undermine the cross-configuration exploration these simulators are meant to enable. Static configuration dependencies.Today's simulatorsfixthe system and model configuration axes by hardcoding both the operation set to profile and the arguments used to invoke each operation, tying themselves to a specific software stack. Vidur [ 3] and Revati [ 5], for example, are tied to Sarathi-Serve [4] and FlashInfer [ 39], while LLMServingSim [ 11, 12] is tied to HuggingFace or vLLM. Each relies on a per-family model template that maps parameters such ashead_dimand num_headsonto its predefined operator set. Consequently, the simulators cover only a single column of the configuration matrix (Figure 1), and any software stack update (e.g.,a changed module"},{"citing_arxiv_id":"2605.01214","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic AI Systems Should Be Designed as Marginal Token Allocators","primary_cat":"cs.AI","submitted_at":"2026-05-02T03:06:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Agentic AI systems should be designed as marginal token allocators that balance benefit against cost, latency, and risk across their layers rather than as unit-priced text generators.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24203","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic Witnessing: Pragmatic and Scalable TEE-Enabled Privacy-Preserving Auditing","primary_cat":"cs.CR","submitted_at":"2026-04-27T09:07:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Agentic Witnessing enables privacy-preserving auditing of semantic properties in private data by running an LLM auditor in a TEE that answers binary queries and produces cryptographic transcripts of its reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"FwdLLM 2024 USENIX ATC [34] 25.5Partial 2 YesNo 8 Prod No 3537 34 502 98.1% NanoFlow 2025 OSDI [44] 21.4 Yes Yes Yes Proto No 4687 39 757 98.9% PowerInfer 2024 SOSP [27] 24.7 Yes Yes Yes Proto No 2691 30 450 98.1% Puffer 2019 NSDI [35] 5.5 Yes Yes Yes Proto No 2243 31 479 98.3% PyRCA 2023 USENIX ATC [23] 3.2 Yes Yes Yes Proto No 1648 27 262 97.1% Sarathi-Serve 2024 OSDI [1] 21.2 YesNo 4 Yes Proto Yes 4543 54 912 99.0% ServerlessLLM 2024 OSDI [10] 21.4 Yes Yes Yes Prod Yes 2319 31 420 98.0% SLOG 2019 PVLDB [26] 21.6 Yes Yes Yes Proto No 3709 42 636 98.5% StreamBox 2024 USENIX ATC [33] 22.1No 3 NoYes Proto No 5039 38 1046 98.9% VeriSMo 2024 OSDI [43] 20.6 Yes Yes Yes Proto Yes 4686 52 957 98.9% Verus 2024 SOSP [22] 21."},{"citing_arxiv_id":"2604.17353","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling","primary_cat":"cs.AI","submitted_at":"2026-04-19T09:59:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and 33%-51% lower hotspot miss rates.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"selected or sampled according to the decoding strategy. Dur- ing decoding, the system reuses the cached keys and values from previous steps, thereby avoiding recomputation over the full prefix and significantly improving serving efficiency. This split execution model makes KV cache management, scheduling, and resource allocation central to the design of high-performance LLM inference systems[1, 28, 41]. 2.2 Test-Time Scaling While training-time scaling has been the primary driver of recent advances in LLM capability[10, 15], its benefits have gradually slowed due to the high cost of pretraining and the limited availability of high-quality data. This trend has motivated the development oftest-time scaling[ 39], which aims to improve model performance by allocating additional"},{"citing_arxiv_id":"2601.14910","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction","primary_cat":"cs.PF","submitted_at":"2026-01-21T11:47:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.09427","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators","primary_cat":"cs.AR","submitted_at":"2025-12-10T08:52:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ODMA raises KV-cache utilization by up to 19.25% and throughput by 23-27% on Cambricon MLU accelerators by dynamically adjusting prediction buckets and using a safety pool for LLM serving.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.09999","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production","primary_cat":"cs.DC","submitted_at":"2025-05-15T06:24:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ServeGen characterizes production LLM inference workloads across model types and generates realistic per-client composed workloads that reduce under-provisioning by 50% in a production validation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.14294","ref_index":283,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Efficient Inference for Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-04-22T15:53:08+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"in the case of the PagedAttention by vLLM [51], the storage 26 Serving System Distributed Systems Splitwise [272], TetriInfer [273], Dist- Serve [274], SpotServe [275], Infinite-LLM [276] Scheduling ORCA [277], vLLM [51], LightLLM [278], DeepSpeed-FastGen [279], FastServe [280], VTC [281] Batching ORCA [277], vLLM [51], Sarathi [282], DeepSpeed-FastGen [279], Sarathi-Serve [283], LightLLM [278] Memory Management S3 [284], vLLM [51], LightLLM [278], FlashIn- fer [285] Fig. 17. Taxonomy of the optimization for LLM serving system. of the head size dimension is structured as a 16-byte con- tiguous vector for K cache, while FlashInfer [285] orches- trates diverse data layouts for the KV cache, accompanied by an appropriately designed memory access scheme."}],"limit":50,"offset":0}