BPDQ creates variable quantization grids from bit-planes and scalar coefficients, refined iteratively with second-order data to minimize output error, enabling 2-bit serving of Qwen2.5-72B on one RTX 3090 at 83.85% GSM8K accuracy.
Towards efficient generative large language model serving: A survey from algorithms to systems
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
ODMA raises KV-cache utilization by up to 19.25% and throughput by 23-27% on Cambricon MLU accelerators by dynamically adjusting prediction buckets and using a safety pool for LLM serving.
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
citing papers explorer
-
BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models
BPDQ creates variable quantization grids from bit-planes and scalar coefficients, refined iteratively with second-order data to minimize output error, enabling 2-bit serving of Qwen2.5-72B on one RTX 3090 at 83.85% GSM8K accuracy.
-
ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators
ODMA raises KV-cache utilization by up to 19.25% and throughput by 23-27% on Cambricon MLU accelerators by dynamically adjusting prediction buckets and using a safety pool for LLM serving.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
-
A Survey on the Memory Mechanism of Large Language Model based Agents
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.