Quantization hurts reasoning? an empirical study on quantized reasoning models.arXiv preprint arXiv:2504.04823

Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, Lu Hou · 2025 · arXiv 2504.04823

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

cs.AI · 2026-06-01 · conditional · novelty 7.0

2-bit quantized reasoning models exhibit process failures like loops and delayed commitment that degrade end-to-end performance, but FP16 planning and loop rescue recover accuracy on MATH-500 from 17.2% to 74.2% for Qwen3-8B while retaining speed gains.

Quantized Reasoning Models Think They Need to Think Longer, but They Do Not

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Post-training quantization increases overthinking errors in reasoning models; a logit penalty on curated overthinking markers reduces CoT length 12-23% without accuracy loss.

Search Your Block Floating Point Scales!

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

LAQuant improves long-decoding accuracy on quantized reasoning models like Qwen3-4B by 15pp on AIME25 via layer-wise lookahead loss, achieving 3.42x speedup over FP16.

QuantClaw: Precision Where It Matters for OpenClaw

cs.AI · 2026-04-24 · unverdicted · novelty 6.0

QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.

FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

FluxMoE decouples MoE expert weights from persistent GPU residency via on-demand paging, achieving up to 3x throughput gains over vLLM in memory-constrained inference without accuracy loss.

Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

cs.AI · 2025-09-15 · unverdicted · novelty 6.0

A pruning technique called Reasoning-Aware Compression (RAC) jointly reconstructs input and chain-of-thought activations to preserve reasoning performance better than standard methods when compressing models like DeepSeek-R1.

Cassandra: Enabling Reasoning LLMs at Edge via Self-Speculative Decoding

cs.AR · 2026-05-26 · unverdicted · novelty 5.0

Cassandra is a self-speculative decoding system that builds a draft model via fine-grained data selection and optimized pruning/mantissa truncation, achieving up to 2.41x speedup over BF16 and 1.81x more tokens than Eagle-3 on Llama 3 8B without training.

PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

cs.LG · 2026-04-14 · unverdicted · novelty 5.0

PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.

ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling

cs.DC · 2026-01-29 · unverdicted · novelty 5.0

ZipMoE delivers up to 72.77% lower inference latency and 6.76x higher throughput for on-device MoE models via lossless compression and cache-affinity scheduling with a claimed provable guarantee.

The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes

cs.CL · 2026-06-09 · unverdicted · novelty 4.0

A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.

citing papers explorer

Showing 12 of 12 citing papers.

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery cs.AI · 2026-06-01 · conditional · none · ref 25
2-bit quantized reasoning models exhibit process failures like loops and delayed commitment that degrade end-to-end performance, but FP16 planning and loop rescue recover accuracy on MATH-500 from 17.2% to 74.2% for Qwen3-8B while retaining speed gains.
Quantized Reasoning Models Think They Need to Think Longer, but They Do Not cs.LG · 2026-05-29 · unverdicted · none · ref 13
Post-training quantization increases overthinking errors in reasoning models; a logit penalty on curated overthinking markers reduces CoT length 12-23% without accuracy loss.
Search Your Block Floating Point Scales! cs.LG · 2026-05-12 · unverdicted · none · ref 141
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss cs.LG · 2026-05-09 · unverdicted · none · ref 34
LAQuant improves long-decoding accuracy on quantized reasoning models like Qwen3-4B by 15pp on AIME25 via layer-wise lookahead loss, achieving 3.42x speedup over FP16.
QuantClaw: Precision Where It Matters for OpenClaw cs.AI · 2026-04-24 · unverdicted · none · ref 28
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization cs.CL · 2026-04-21 · unverdicted · none · ref 10
LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving cs.LG · 2026-04-03 · unverdicted · none · ref 35
FluxMoE decouples MoE expert weights from persistent GPU residency via on-demand paging, achieving up to 3x throughput gains over vLLM in memory-constrained inference without accuracy loss.
Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction cs.AI · 2025-09-15 · unverdicted · none · ref 11
A pruning technique called Reasoning-Aware Compression (RAC) jointly reconstructs input and chain-of-thought activations to preserve reasoning performance better than standard methods when compressing models like DeepSeek-R1.
Cassandra: Enabling Reasoning LLMs at Edge via Self-Speculative Decoding cs.AR · 2026-05-26 · unverdicted · none · ref 34
Cassandra is a self-speculative decoding system that builds a draft model via fine-grained data selection and optimized pruning/mantissa truncation, achieving up to 2.41x speedup over BF16 and 1.81x more tokens than Eagle-3 on Llama 3 8B without training.
PubSwap: Public-Data Off-Policy Coordination for Federated RLVR cs.LG · 2026-04-14 · unverdicted · none · ref 11
PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.
ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling cs.DC · 2026-01-29 · unverdicted · none · ref 9
ZipMoE delivers up to 72.77% lower inference latency and 6.76x higher throughput for on-device MoE models via lossless compression and cache-affinity scheduling with a claimed provable guarantee.
The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes cs.CL · 2026-06-09 · unverdicted · none · ref 154
A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.

Quantization hurts reasoning? an empirical study on quantized reasoning models.arXiv preprint arXiv:2504.04823

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer