pith. sign in

hub

Llm inference unveiled: Survey and roofline model insights

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

hub tools

citation-role summary

background 2 method 1

citation-polarity summary

representative citing papers

OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

cs.LG · 2026-05-06 · unverdicted · novelty 6.0 · 2 refs

OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.

Gated Subspace Inference for Transformer Acceleration

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.

HybridFlow: A Flexible and Efficient RLHF Framework

cs.LG · 2024-09-28 · unverdicted · novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

EinSort: Sorting is All We Need for Tensorizing LLM

cs.LG · 2026-06-07 · unverdicted · novelty 5.0

Sorting tensor indices enables an adaptive tensorization method that discovers low-rank structure in LLM weights and KV caches, yielding better reconstruction quality than baselines.

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

citing papers explorer

Showing 22 of 22 citing papers.