Title resolution pending

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L · 2024 · arXiv 2404.00456

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

cs.PF · 2026-05-07 · unverdicted · novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

cs.LG · 2026-02-23 · unverdicted · novelty 7.0

QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memory savings on the quantized components.

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

cs.CL · 2025-12-01 · conditional · novelty 7.0

Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.

Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

cs.LG · 2026-04-30 · unverdicted · novelty 6.0

ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.

MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs

cs.AR · 2026-04-17 · unverdicted · novelty 6.0

MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.

Rethinking Residual Errors in Compensation-based LLM Quantization

cs.LG · 2026-04-09 · conditional · novelty 6.0

Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.

Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

cs.CL · 2026-01-09 · unverdicted · novelty 6.0

Double achieves up to 5.3x inference speedup on 70B LLMs via synchronous double retrieval speculative parallelism that is lossless and outperforms trained baselines like EAGLE-3.

You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

cs.CL · 2025-11-09 · conditional · novelty 6.0

TAQ estimates per-layer importance from hidden representations and output sensitivity on task calibration data to allocate mixed precision in a training-free PTQ setting, outperforming task-agnostic baselines on accuracy-memory ratio across benchmarks.

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

cs.LG · 2025-04-28 · unverdicted · novelty 6.0

TurboQuant achieves near-optimal vector quantization distortion for both MSE and inner products via random rotation and per-coordinate scalar quantization, with a formal proof that it matches lower bounds within a factor of approximately 2.7.

31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding

cs.AR · 2026-05-10 · unverdicted · novelty 5.0

A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

cs.LG · 2026-05-05 · unverdicted · novelty 5.0 · 2 refs

HeadQ applies score-space logit corrections for keys and attention-weighted surrogates for values to KV-cache quantization, removing 84-94% of excess perplexity in 2-bit key experiments across six models.

StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

cs.LG · 2026-05-04 · accept · novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

High-Rate Quantized Matrix Multiplication I

cs.IT · 2026-01-23 · unverdicted · novelty 5.0

High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

cs.AR · 2025-09-11 · unverdicted · novelty 5.0

PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

cs.LG · 2024-12-19 · unverdicted · novelty 5.0

MixLLM uses global output-feature importance to set mixed bit-widths for LLM quantization and adds two-step dequantization plus software pipelining for system efficiency.

DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization

cs.CV · 2026-04-20 · unverdicted · novelty 4.0

DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3 with lower cost.

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

citing papers explorer

Showing 17 of 17 citing papers.

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon cs.PF · 2026-05-07 · unverdicted · none · ref 7
A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models cs.LG · 2026-02-23 · unverdicted · none · ref 1
QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memory savings on the quantized components.
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling cs.CL · 2025-12-01 · conditional · none · ref 37
Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization cs.LG · 2026-04-30 · unverdicted · none · ref 8
ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs cs.AR · 2026-04-17 · unverdicted · none · ref 4
MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
Rethinking Residual Errors in Compensation-based LLM Quantization cs.LG · 2026-04-09 · conditional · none · ref 1
Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism cs.CL · 2026-01-09 · unverdicted · none · ref 2
Double achieves up to 5.3x inference speedup on 70B LLMs via synchronous double retrieval speculative parallelism that is lossless and outperforms trained baselines like EAGLE-3.
You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations cs.CL · 2025-11-09 · conditional · none · ref 5
TAQ estimates per-layer importance from hidden representations and output sensitivity on task calibration data to allocate mixed precision in a training-free PTQ setting, outperforming task-agnostic baselines on accuracy-memory ratio across benchmarks.
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate cs.LG · 2025-04-28 · unverdicted · none · ref 8
TurboQuant achieves near-optimal vector quantization distortion for both MSE and inner products via random rotation and per-coordinate scalar quantization, with a formal proof that it matches lower bounds within a factor of approximately 2.7.
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding cs.AR · 2026-05-10 · unverdicted · none · ref 11
A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization cs.LG · 2026-05-05 · unverdicted · none · ref 13 · 2 links
HeadQ applies score-space logit corrections for keys and attention-weighted surrogates for values to KV-cache quantization, removing 84-94% of excess perplexity in 2-bit key experiments across six models.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k cs.LG · 2026-05-04 · accept · none · ref 26
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
High-Rate Quantized Matrix Multiplication I cs.IT · 2026-01-23 · unverdicted · none · ref 17
High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference cs.AR · 2025-09-11 · unverdicted · none · ref 6
PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design cs.LG · 2024-12-19 · unverdicted · none · ref 3
MixLLM uses global output-feature importance to set mixed bit-widths for LLM quantization and adds two-step dequantization plus software pipelining for system efficiency.
DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization cs.CV · 2026-04-20 · unverdicted · none · ref 1
DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3 with lower cost.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 217
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer