hub

Llm inference unveiled: Survey and roofline model insights

Yuan, Z · 2024 · arXiv 2402.16363

23 Pith papers cite this work. Polarity classification is still indexing.

23 Pith papers citing it

read on arXiv browse 23 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture

cs.AI · 2026-05-29 · unverdicted · novelty 7.0

Proposes the Intelligent Computing Architecture (ICA) as a six-layer framework with dual probabilistic-deterministic planes and three Amdahl-style heuristics to unify design of LLM-based systems.

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics

cs.DC · 2026-04-08 · unverdicted · novelty 7.0

Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

cs.LG · 2026-04-05 · unverdicted · novelty 7.0

Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.

ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

cs.AR · 2026-03-28 · unverdicted · novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.

Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads

cs.LG · 2026-01-29 · unverdicted · novelty 7.0

A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.

Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

cs.CL · 2026-06-23 · unverdicted · novelty 6.0

Dustin reports 27.85x self-attention and 9.17x end-to-end speedups at 32k length on Qwen2.5-72B using draft-augmented sparse verification with negligible accuracy loss on PG-19 and LongBench.

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

KVDrive introduces a multi-tier KV cache management system that achieves up to 1.74x higher throughput for long-context LLM inference through adaptive cache placement, pipeline restructuring, and cross-tier coordination while preserving accuracy.

OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

cs.LG · 2026-05-06 · unverdicted · novelty 6.0 · 2 refs

OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.

Gated Subspace Inference for Transformer Acceleration

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.

Kara: Efficient Reasoning LLM Serving via Sliding-Window KV Cache Compression

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

Kara proposes a decoding-time sliding-window KV cache compression technique with bidirectional attention scoring and Token2Chunk for flexible chunk preservation, implemented in KvLLM on vLLM to improve throughput for reasoning LLMs.

Sim-FA: A GPGPU Simulator Framework for Fine-Grained FlashAttention Pipeline Analysis

cs.AR · 2026-05-01 · unverdicted · novelty 6.0

Sim-FA is a new simulator that instruments FlashAttention-3 for cycle-accurate GPGPU analysis, achieving 5.7% average error on H800 while explaining inaccuracies in existing DRAM traffic models.

Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.

SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference

cs.AI · 2026-02-05 · unverdicted · novelty 6.0

SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across multiple LLMs.

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

cs.LG · 2025-03-25 · unverdicted · novelty 6.0

LogQuant applies log-based filtering for 2-bit KV cache quantization in LLMs, claiming 25% higher throughput, 60% larger batches, and 40-200% accuracy gains on math/code tasks versus existing compression approaches.

HybridFlow: A Flexible and Efficient RLHF Framework

cs.LG · 2024-09-28 · unverdicted · novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

cs.CL · 2023-12-10 · unverdicted · novelty 6.0

ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

cs.LG · 2026-06-17 · unverdicted · novelty 5.0

EfficientRollout applies self-speculative decoding with quantized drafter induction and system-aware acceptance policies to cut RL rollout latency up to 19.6% while preserving final model quality.

EinSort: Sorting is All We Need for Tensorizing LLM

cs.LG · 2026-06-07 · unverdicted · novelty 5.0

Sorting tensor indices enables an adaptive tensorization method that discovers low-rank structure in LLM weights and KV caches, yielding better reconstruction quality than baselines.

A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks

cs.DC · 2026-04-23 · unverdicted · novelty 4.0

An LLM planner for task decomposition and a decomposition-aware scheduler in multi-user WiFi networks reduce average latency by 20% and improve overall reward by 80% versus local-only and nearest-edge baselines.

Operator Fusion for LLM Inference on the Tensix Architecture

cs.LG · 2026-06-03 · unverdicted · novelty 3.0

Fusing RMSNorm with matmul in attention and FFN on Tensix reduces attention latency up to 37.44% and MLP up to 15.89% with PCC above 98.75% on Qwen models.

On-Device Generative AI for GDPR-Compliant Visual Monitoring: Natural Language Alerts from Local Object Detection

cs.CV · 2026-05-28 · unverdicted · novelty 3.0

A local pipeline on Raspberry Pi 5 with YOLOv5n-seg and Phi-3 Mini produces text alerts from on-device detection while keeping all image data private to meet GDPR requirements.

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

citing papers explorer

Showing 23 of 23 citing papers.

Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture cs.AI · 2026-05-29 · unverdicted · none · ref 170
Proposes the Intelligent Computing Architecture (ICA) as a six-layer framework with dual probabilistic-deterministic planes and three Amdahl-style heuristics to unify design of LLM-based systems.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding cs.CV · 2026-04-14 · unverdicted · none · ref 49
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics cs.DC · 2026-04-08 · unverdicted · none · ref 37
Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling cs.LG · 2026-04-05 · unverdicted · none · ref 12
Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs cs.AR · 2026-03-28 · unverdicted · none · ref 60
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads cs.LG · 2026-01-29 · unverdicted · none · ref 10
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding cs.CL · 2026-06-23 · unverdicted · none · ref 16
Dustin reports 27.85x self-attention and 9.17x end-to-end speedups at 32k length on Qwen2.5-72B using draft-augmented sparse verification with negligible accuracy loss on PG-19 and LongBench.
KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference cs.CL · 2026-05-18 · unverdicted · none · ref 41
KVDrive introduces a multi-tier KV cache management system that achieves up to 1.74x higher throughput for long-context LLM inference through adaptive cache placement, pipeline restructuring, and cross-tier coordination while preserving accuracy.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization cs.LG · 2026-05-06 · unverdicted · none · ref 18 · 2 links
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
Gated Subspace Inference for Transformer Acceleration cs.LG · 2026-05-04 · unverdicted · none · ref 33
Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.
Kara: Efficient Reasoning LLM Serving via Sliding-Window KV Cache Compression cs.CL · 2026-05-01 · unverdicted · none · ref 5
Kara proposes a decoding-time sliding-window KV cache compression technique with bidirectional attention scoring and Token2Chunk for flexible chunk preservation, implemented in KvLLM on vLLM to improve throughput for reasoning LLMs.
Sim-FA: A GPGPU Simulator Framework for Fine-Grained FlashAttention Pipeline Analysis cs.AR · 2026-05-01 · unverdicted · none · ref 13
Sim-FA is a new simulator that instruments FlashAttention-3 for cycle-accurate GPGPU analysis, achieving 5.7% average error on H800 while explaining inaccuracies in existing DRAM traffic models.
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding cs.CL · 2026-05-01 · unverdicted · none · ref 42
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference cs.AI · 2026-02-05 · unverdicted · none · ref 28
SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across multiple LLMs.
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation cs.LG · 2025-03-25 · unverdicted · none · ref 18
LogQuant applies log-based filtering for 2-bit KV cache quantization in LLMs, claiming 25% higher throughput, 60% larger batches, and 40-200% accuracy gains on math/code tasks versus existing compression approaches.
HybridFlow: A Flexible and Efficient RLHF Framework cs.LG · 2024-09-28 · unverdicted · none · ref 96
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models cs.CL · 2023-12-10 · unverdicted · none · ref 25
ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.
EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts cs.LG · 2026-06-17 · unverdicted · none · ref 65
EfficientRollout applies self-speculative decoding with quantized drafter induction and system-aware acceptance policies to cut RL rollout latency up to 19.6% while preserving final model quality.
EinSort: Sorting is All We Need for Tensorizing LLM cs.LG · 2026-06-07 · unverdicted · none · ref 94
Sorting tensor indices enables an adaptive tensorization method that discovers low-rank structure in LLM weights and KV caches, yielding better reconstruction quality than baselines.
A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks cs.DC · 2026-04-23 · unverdicted · none · ref 11
An LLM planner for task decomposition and a decomposition-aware scheduler in multi-user WiFi networks reduce average latency by 20% and improve overall reward by 80% versus local-only and nearest-edge baselines.
Operator Fusion for LLM Inference on the Tensix Architecture cs.LG · 2026-06-03 · unverdicted · none · ref 1
Fusing RMSNorm with matmul in attention and FFN on Tensix reduces attention latency up to 37.44% and MLP up to 15.89% with PCC above 98.75% on Qwen models.
On-Device Generative AI for GDPR-Compliant Visual Monitoring: Natural Language Alerts from Local Object Detection cs.CV · 2026-05-28 · unverdicted · none · ref 17
A local pipeline on Raspberry Pi 5 with YOLOv5n-seg and Phi-3 Mini produces text alerts from on-device detection while keeping all image data private to meet GDPR requirements.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 27
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Llm inference unveiled: Survey and roofline model insights

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer