Title resolution pending

Patel, P · 2024 · arXiv 2311.18677

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

cs.DC · 2026-05-08 · unverdicted · novelty 7.0

Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.

SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving

cs.DC · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

SplitZip is a new GPU-friendly lossless compressor for KV cache tensors that exploits exponent redundancy to achieve over 600 GB/s compression throughput and up to 1.32x faster transfers in disaggregated LLM serving.

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

cs.DC · 2026-05-11 · unverdicted · novelty 6.0

Chakra introduces a portable, interoperable graph-based execution trace format for distributed ML workloads along with supporting tools to standardize performance benchmarking and software-hardware co-design.

MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs

cs.AR · 2026-04-17 · unverdicted · novelty 6.0

MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.

Reduced-Mass Orbital AI Inference via Integrated Solar, Compute, and Radiator Panels

cs.DC · 2026-04-09 · unverdicted · novelty 6.0

Integrated solar-compute-radiator panels enable orbital satellites to achieve over 100 kW of AI inference compute per metric ton launched, supporting thousands of simultaneous large language model sessions.

Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack

cs.AI · 2026-05-11 · unverdicted · novelty 4.0

Workload-aware optimizations for LLM serving in AML and fraud detection yield substantial gains in throughput, latency, and GPU utilization on synthetic compliance prompts.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving cs.LG · 2026-04-17 · unverdicted · none · ref 25
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer