Proposes the Intelligent Computing Architecture (ICA) as a six-layer framework with dual probabilistic-deterministic planes and three Amdahl-style heuristics to unify design of LLM-based systems.
hub
Llm inference unveiled: Survey and roofline model insights
22 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
KVDrive introduces a multi-tier KV cache management system that achieves up to 1.74x higher throughput for long-context LLM inference through adaptive cache placement, pipeline restructuring, and cross-tier coordination while preserving accuracy.
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.
Kara proposes a decoding-time sliding-window KV cache compression technique with bidirectional attention scoring and Token2Chunk for flexible chunk preservation, implemented in KvLLM on vLLM to improve throughput for reasoning LLMs.
Sim-FA is a new simulator that instruments FlashAttention-3 for cycle-accurate GPGPU analysis, achieving 5.7% average error on H800 while explaining inaccuracies in existing DRAM traffic models.
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across multiple LLMs.
LogQuant applies log-based filtering for 2-bit KV cache quantization in LLMs, claiming 25% higher throughput, 60% larger batches, and 40-200% accuracy gains on math/code tasks versus existing compression approaches.
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.
EfficientRollout applies self-speculative decoding with quantized drafter induction and system-aware acceptance policies to cut RL rollout latency up to 19.6% while preserving final model quality.
Sorting tensor indices enables an adaptive tensorization method that discovers low-rank structure in LLM weights and KV caches, yielding better reconstruction quality than baselines.
An LLM planner for task decomposition and a decomposition-aware scheduler in multi-user WiFi networks reduce average latency by 20% and improve overall reward by 80% versus local-only and nearest-edge baselines.
Fusing RMSNorm with matmul in attention and FFN on Tensix reduces attention latency up to 37.44% and MLP up to 15.89% with PCC above 98.75% on Qwen models.
A local pipeline on Raspberry Pi 5 with YOLOv5n-seg and Phi-3 Mini produces text alerts from on-device detection while keeping all image data private to meet GDPR requirements.
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
citing papers explorer
-
Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture
Proposes the Intelligent Computing Architecture (ICA) as a six-layer framework with dual probabilistic-deterministic planes and three Amdahl-style heuristics to unify design of LLM-based systems.
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
-
Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling
Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.
-
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
-
Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
-
KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference
KVDrive introduces a multi-tier KV cache management system that achieves up to 1.74x higher throughput for long-context LLM inference through adaptive cache placement, pipeline restructuring, and cross-tier coordination while preserving accuracy.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
-
Gated Subspace Inference for Transformer Acceleration
Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.
-
Kara: Efficient Reasoning LLM Serving via Sliding-Window KV Cache Compression
Kara proposes a decoding-time sliding-window KV cache compression technique with bidirectional attention scoring and Token2Chunk for flexible chunk preservation, implemented in KvLLM on vLLM to improve throughput for reasoning LLMs.
-
Sim-FA: A GPGPU Simulator Framework for Fine-Grained FlashAttention Pipeline Analysis
Sim-FA is a new simulator that instruments FlashAttention-3 for cycle-accurate GPGPU analysis, achieving 5.7% average error on H800 while explaining inaccuracies in existing DRAM traffic models.
-
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
-
SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference
SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across multiple LLMs.
-
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
LogQuant applies log-based filtering for 2-bit KV cache quantization in LLMs, claiming 25% higher throughput, 60% larger batches, and 40-200% accuracy gains on math/code tasks versus existing compression approaches.
-
HybridFlow: A Flexible and Efficient RLHF Framework
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
-
ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models
ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.
-
EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts
EfficientRollout applies self-speculative decoding with quantized drafter induction and system-aware acceptance policies to cut RL rollout latency up to 19.6% while preserving final model quality.
-
EinSort: Sorting is All We Need for Tensorizing LLM
Sorting tensor indices enables an adaptive tensorization method that discovers low-rank structure in LLM weights and KV caches, yielding better reconstruction quality than baselines.
-
A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks
An LLM planner for task decomposition and a decomposition-aware scheduler in multi-user WiFi networks reduce average latency by 20% and improve overall reward by 80% versus local-only and nearest-edge baselines.
-
Operator Fusion for LLM Inference on the Tensix Architecture
Fusing RMSNorm with matmul in attention and FFN on Tensix reduces attention latency up to 37.44% and MLP up to 15.89% with PCC above 98.75% on Qwen models.
-
On-Device Generative AI for GDPR-Compliant Visual Monitoring: Natural Language Alerts from Local Object Detection
A local pipeline on Raspberry Pi 5 with YOLOv5n-seg and Phi-3 Mini produces text alerts from on-device detection while keeping all image data private to meet GDPR requirements.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.