Title resolution pending

Frantar, E · 2024 · arXiv 2408.11743

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

stat.ML · 2026-05-13 · unverdicted · novelty 7.0

MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 for MXFP4 with reduced HBM traffic.

Statistically-Lossless Quantization of Large Language Models

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

SLQ achieves task-lossless LLM quantization below 4 bits per parameter and distribution-lossless at 5-6 bits on average, with 1.7-3.6x speedups over FP16.

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 while fitting models into previously infeasible memory budgets.

Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.

The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

cs.AI · 2026-02-14 · unverdicted · novelty 6.0

Reducing precision from 16-bit to 8/4-bit in multi-hop reasoning creates a quantization trap that raises net energy consumption and degrades accuracy, breaking linear scaling laws.

On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks

cs.LG · 2026-04-22 · unverdicted · novelty 4.0

Diffusion coding model CoDA shows smaller accuracy drops than Qwen3-1.7B under 2-4 bit quantization on HumanEval and MBPP.

citing papers explorer

Showing 6 of 6 citing papers.

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference stat.ML · 2026-05-13 · unverdicted · none · ref 5
MSD eliminates dequantization from the GEMM path by decomposing BF16 activations into multiple low-precision parts that multiply directly with INT8 or MXFP4 weights, achieving near-16 effective bits for INT8 and 6.6 for MXFP4 with reduced HBM traffic.
Statistically-Lossless Quantization of Large Language Models cs.LG · 2026-05-04 · unverdicted · none · ref 4
SLQ achieves task-lossless LLM quantization below 4 bits per parameter and distribution-lossless at 5-6 bits on average, with 1.7-3.6x speedups over FP16.
MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference cs.LG · 2026-04-22 · unverdicted · none · ref 8
MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 while fitting models into previously infeasible memory budgets.
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate cs.LG · 2026-04-15 · unverdicted · none · ref 19
DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning cs.AI · 2026-02-14 · unverdicted · none · ref 7
Reducing precision from 16-bit to 8/4-bit in multi-hop reasoning creates a quantization trap that raises net energy consumption and degrades accuracy, breaking linear scaling laws.
On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks cs.LG · 2026-04-22 · unverdicted · none · ref 9
Diffusion coding model CoDA shows smaller accuracy drops than Qwen3-1.7B under 2-4 bit quantization on HumanEval and MBPP.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer