2-bit quantized reasoning models exhibit process failures like loops and delayed commitment that degrade end-to-end performance, but FP16 planning and loop rescue recover accuracy on MATH-500 from 17.2% to 74.2% for Qwen3-8B while retaining speed gains.
Quantization hurts reasoning? an empirical study on quantized reasoning models.arXiv preprint arXiv:2504.04823
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Post-training quantization increases overthinking errors in reasoning models; a logit penalty on curated overthinking markers reduces CoT length 12-23% without accuracy loss.
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
LAQuant improves long-decoding accuracy on quantized reasoning models like Qwen3-4B by 15pp on AIME25 via layer-wise lookahead loss, achieving 3.42x speedup over FP16.
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
FluxMoE decouples MoE expert weights from persistent GPU residency via on-demand paging, achieving up to 3x throughput gains over vLLM in memory-constrained inference without accuracy loss.
A pruning technique called Reasoning-Aware Compression (RAC) jointly reconstructs input and chain-of-thought activations to preserve reasoning performance better than standard methods when compressing models like DeepSeek-R1.
Cassandra is a self-speculative decoding system that builds a draft model via fine-grained data selection and optimized pruning/mantissa truncation, achieving up to 2.41x speedup over BF16 and 1.81x more tokens than Eagle-3 on Llama 3 8B without training.
PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.
ZipMoE delivers up to 72.77% lower inference latency and 6.76x higher throughput for on-device MoE models via lossless compression and cache-affinity scheduling with a claimed provable guarantee.
A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.
citing papers explorer
-
Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery
2-bit quantized reasoning models exhibit process failures like loops and delayed commitment that degrade end-to-end performance, but FP16 planning and loop rescue recover accuracy on MATH-500 from 17.2% to 74.2% for Qwen3-8B while retaining speed gains.
-
Quantized Reasoning Models Think They Need to Think Longer, but They Do Not
Post-training quantization increases overthinking errors in reasoning models; a logit penalty on curated overthinking markers reduces CoT length 12-23% without accuracy loss.
-
Search Your Block Floating Point Scales!
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
-
LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss
LAQuant improves long-decoding accuracy on quantized reasoning models like Qwen3-4B by 15pp on AIME25 via layer-wise lookahead loss, achieving 3.42x speedup over FP16.
-
QuantClaw: Precision Where It Matters for OpenClaw
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
-
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
-
FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving
FluxMoE decouples MoE expert weights from persistent GPU residency via on-demand paging, achieving up to 3x throughput gains over vLLM in memory-constrained inference without accuracy loss.
-
Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction
A pruning technique called Reasoning-Aware Compression (RAC) jointly reconstructs input and chain-of-thought activations to preserve reasoning performance better than standard methods when compressing models like DeepSeek-R1.
-
Cassandra: Enabling Reasoning LLMs at Edge via Self-Speculative Decoding
Cassandra is a self-speculative decoding system that builds a draft model via fine-grained data selection and optimized pruning/mantissa truncation, achieving up to 2.41x speedup over BF16 and 1.81x more tokens than Eagle-3 on Llama 3 8B without training.
-
PubSwap: Public-Data Off-Policy Coordination for Federated RLVR
PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.
-
ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling
ZipMoE delivers up to 72.77% lower inference latency and 6.76x higher throughput for on-device MoE models via lossless compression and cache-affinity scheduling with a claimed provable guarantee.
-
The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes
A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.