LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5verdicts
UNVERDICTED 5representative citing papers
Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.
Hypic enables position-independent KV caching for hybrid-attention models via segment-cumulative operators and boundary seam recomputation, delivering 2.45x average TTFT reduction and up to 2.0x throughput gain.
GRAB improves multi-table QA performance by encoding relational data as graphs and bridging structural signals to frozen LLMs through latent tokens.
CacheWeaver is a lightweight scheduling layer that orders evidence to exploit prefix caching, reducing median TTFT by 20-33% across vLLM setups while preserving answer quality.
citing papers explorer
-
LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
-
When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation
Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.
-
HYPIC: Accelerating Hybrid-Attention LLM Serving with Position-Independent Caching
Hypic enables position-independent KV caching for hybrid-attention models via segment-cumulative operators and boundary seam recomputation, delivering 2.45x average TTFT reduction and up to 2.0x throughput gain.
-
Latent Bridges for Multi-Table Question Answering
GRAB improves multi-table QA performance by encoding relational data as graphs and bridging structural signals to frozen LLMs through latent tokens.
-
CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference
CacheWeaver is a lightweight scheduling layer that orders evidence to exploit prefix caching, reducing median TTFT by 20-33% across vLLM setups while preserving answer quality.