Towards efficient generative large language model serving: A survey from algorithms to systems

Miao, X · 2023 · arXiv 2312.15234

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

cs.LG · 2026-02-04 · unverdicted · novelty 5.0

BPDQ creates variable quantization grids from bit-planes and scalar coefficients, refined iteratively with second-order data to minimize output error, enabling 2-bit serving of Qwen2.5-72B on one RTX 3090 at 83.85% GSM8K accuracy.

ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators

cs.AR · 2025-12-10 · unverdicted · novelty 5.0

ODMA raises KV-cache utilization by up to 19.25% and throughput by 23-27% on Cambricon MLU accelerators by dynamically adjusting prediction buckets and using a safety pool for LLM serving.

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

A Survey on the Memory Mechanism of Large Language Model based Agents

cs.AI · 2024-04-21 · accept · novelty 3.0

A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.

citing papers explorer

Showing 4 of 4 citing papers.

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models cs.LG · 2026-02-04 · unverdicted · none · ref 16
BPDQ creates variable quantization grids from bit-planes and scalar coefficients, refined iteratively with second-order data to minimize output error, enabling 2-bit serving of Qwen2.5-72B on one RTX 3090 at 83.85% GSM8K accuracy.
ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators cs.AR · 2025-12-10 · unverdicted · none · ref 17
ODMA raises KV-cache utilization by up to 19.25% and throughput by 23-27% on Cambricon MLU accelerators by dynamically adjusting prediction buckets and using a safety pool for LLM serving.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 22
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
A Survey on the Memory Mechanism of Large Language Model based Agents cs.AI · 2024-04-21 · accept · none · ref 28
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.

Towards efficient generative large language model serving: A survey from algorithms to systems

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer