PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
OLLA: Optimizing the lifetime and location of arrays to reduce the memory usage of neural networks
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
representative citing papers
A co-design framework using approximate matrix decomposition and genetic algorithms delivers 33% average latency reduction in TinyML CNN FPGA accelerators with 1.3% average accuracy loss versus standard systolic arrays.
citing papers explorer
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
-
Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition
A co-design framework using approximate matrix decomposition and genetic algorithms delivers 33% average latency reduction in TinyML CNN FPGA accelerators with 1.3% average accuracy loss versus standard systolic arrays.