Title resolution pending

Pol G · 2025 · arXiv 2503.08311

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective

cs.LG · 2026-04-28 · unverdicted · novelty 7.0

KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior heuristics in experiments.

Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures

cs.DC · 2026-04-10 · unverdicted · novelty 6.0

Watt Counts supplies over 5,000 energy measurements across 50 LLMs and 10 GPUs and shows that hardware-aware selection can reduce server-scenario energy use by up to 70 percent with little effect on user experience.

Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective

cs.AI · 2025-11-01 · conditional · novelty 5.0

The paper analyzes CPU bottlenecks in agentic AI serving, selects representative workloads, and demonstrates that CPU-aware scheduling optimizations COMB and MAS can reduce P50 latency by up to 1.7x and total latency by up to 2.49x on two hardware systems.

Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

cs.LG · 2025-09-25 · unverdicted · novelty 5.0

Learning-augmented LRU achieves 1-consistency and O(k)-robustness for GPU caching with low overhead, implemented in LCR to cut P99 TTFT by up to 28.3% on LLM workloads and raise throughput by up to 24.2% on DLRM workloads.

citing papers explorer

Showing 4 of 4 citing papers.

Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective cs.LG · 2026-04-28 · unverdicted · none · ref 23
KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior heuristics in experiments.
Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures cs.DC · 2026-04-10 · unverdicted · none · ref 43
Watt Counts supplies over 5,000 energy measurements across 50 LLMs and 10 GPUs and shows that hardware-aware selection can reduce server-scenario energy use by up to 70 percent with little effect on user experience.
Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective cs.AI · 2025-11-01 · conditional · none · ref 31
The paper analyzes CPU bottlenecks in agentic AI serving, selects representative workloads, and demonstrates that CPU-aware scheduling optimizations COMB and MAS can reduce P50 latency by up to 1.7x and total latency by up to 2.49x on two hardware systems.
Toward Robust and Efficient ML-Based GPU Caching for Modern Inference cs.LG · 2025-09-25 · unverdicted · none · ref 27
Learning-augmented LRU achieves 1-consistency and O(k)-robustness for GPU caching with low overhead, implemented in LCR to cut P99 TTFT by up to 28.3% on LLM workloads and raise throughput by up to 24.2% on DLRM workloads.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer