and Zhang, Hao and Stoica, Ion , booktitle =

Kwon, Woosuk, Li, Zhuohan, Zhuang, Siyuan, Sheng, Ying, Zheng, Lianmin, Yu, Cody Hao · 2023

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

browse 7 citing papers

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.

Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faster inference with comparable quality.

Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Feather uses reinforcement learning and a Chunked Hash Tree to balance batch size against prefix homogeneity in LLM inference, delivering 2-10x higher throughput than existing schedulers.

SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs

cs.SE · 2026-05-06 · unverdicted · novelty 6.0

SynConfRoute routes code completions using syntax validation and token confidence, improving pass@1 by up to 31% on hard tasks and reducing accelerator usage by 58% versus always using the largest model.

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

cs.DC · 2026-05-06 · unverdicted · novelty 6.0

HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.

Parallel Prefix Verification for Speculative Generation

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

PARSE accelerates LLM inference via parallel semantic prefix verification in a single forward pass, delivering 1.25x-4.3x speedups alone and up to 4.5x when combined with EAGLE-3.

AAFLOW: Scalable Patterns for Agentic AI Workflows

cs.DC · 2026-05-04 · unverdicted · novelty 5.0

AAFLOW is a unified distributed runtime that models agentic workflows as operators with a zero-copy data plane using Apache Arrow and Cylon, achieving up to 4.64x pipeline speedup through improved data flow and batching.

citing papers explorer

Showing 2 of 2 citing papers after filters.

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving cs.DC · 2026-05-06 · unverdicted · none · ref 25
HELM adaptively partitions HBM between EMB and KV caches via a three-layer PPO controller and EMB-KV-aware scheduling, reducing P99 latency by 24-38% while achieving 93.5-99.6% SLO satisfaction on production workloads.
AAFLOW: Scalable Patterns for Agentic AI Workflows cs.DC · 2026-05-04 · unverdicted · none · ref 61
AAFLOW is a unified distributed runtime that models agentic workflows as operators with a zero-copy data plane using Apache Arrow and Cylon, achieving up to 4.64x pipeline speedup through improved data flow and batching.

and Zhang, Hao and Stoica, Ion , booktitle =

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer