OLLA: Optimizing the lifetime and location of arrays to reduce the memory usage of neural networks

Benoit Steiner, Mostafa Elhoushi, Jacob Kahn, James Hegarty · 2022 · arXiv 2210.12924

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Efficient Memory Management for Large Language Model Serving with PagedAttention

cs.LG · 2023-09-12 · conditional · novelty 7.0

PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.

Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition

cs.AR · 2026-04-17 · unverdicted · novelty 6.0

A co-design framework using approximate matrix decomposition and genetic algorithms delivers 33% average latency reduction in TinyML CNN FPGA accelerators with 1.3% average accuracy loss versus standard systolic arrays.

citing papers explorer

Showing 2 of 2 citing papers.

Efficient Memory Management for Large Language Model Serving with PagedAttention cs.LG · 2023-09-12 · conditional · none · ref 52
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
Co-Design of CNN Accelerators for TinyML using Approximate Matrix Decomposition cs.AR · 2026-04-17 · unverdicted · none · ref 1
A co-design framework using approximate matrix decomposition and genetic algorithms delivers 33% average latency reduction in TinyML CNN FPGA accelerators with 1.3% average accuracy loss versus standard systolic arrays.

OLLA: Optimizing the lifetime and location of arrays to reduce the memory usage of neural networks

fields

years

verdicts

representative citing papers

citing papers explorer