Sparq attention: Bandwidth-efficient llm inference

· 2023 · arXiv 2312.04985

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

MultiPath Memory Access: Breaking Host-GPU Bandwidth Bottlenecks in LLM Services

cs.DC · 2025-12-18 · unverdicted · novelty 7.0

MMA routes host-GPU transfers over multiple available paths to deliver 4.62x higher peak bandwidth and lower latencies in LLM serving without hardware or driver changes.

Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing pipelines.

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

cs.AR · 2026-04-27 · unverdicted · novelty 6.0

Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.

An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

cs.LG · 2026-05-08 · unverdicted · novelty 5.0

Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.

ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing

cs.LG · 2025-07-29 · unverdicted · novelty 5.0

ReasonCache reuses similar KV cache states across reasoning steps in LRMs via collaborative filtering to boost serving throughput by up to 89.2% while preserving accuracy.

Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM

cs.CL · 2025-05-09 · unverdicted · novelty 5.0

STARC remaps sparse KV caches by semantic clustering for PIM hardware, delivering 19-31% lower attention latency and 19-27% lower energy versus token-wise sparsity, with larger gains under tight KV budgets.

citing papers explorer

Showing 6 of 6 citing papers.

MultiPath Memory Access: Breaking Host-GPU Bandwidth Bottlenecks in LLM Services cs.DC · 2025-12-18 · unverdicted · none · ref 35
MMA routes host-GPU transfers over multiple available paths to deliver 4.62x higher peak bandwidth and lower latencies in LLM serving without hardware or driver changes.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache cs.LG · 2026-05-07 · unverdicted · none · ref 39
Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing pipelines.
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding cs.AR · 2026-04-27 · unverdicted · none · ref 44
Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference cs.LG · 2026-05-08 · unverdicted · none · ref 38
Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.
ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing cs.LG · 2025-07-29 · unverdicted · none · ref 21
ReasonCache reuses similar KV cache states across reasoning steps in LRMs via collaborative filtering to boost serving throughput by up to 89.2% while preserving accuracy.
Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM cs.CL · 2025-05-09 · unverdicted · none · ref 33
STARC remaps sparse KV caches by semantic clustering for PIM hardware, delivering 19-31% lower attention latency and 19-27% lower energy versus token-wise sparsity, with larger gains under tight KV budgets.

Sparq attention: Bandwidth-efficient llm inference

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer