Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning

Zhen Li, Yupeng Su, Runming Yang, Congkai Xie, Zheng Wang, Zhongwei Xie, Ngai Wong, Hongxia Yang · 2025 · arXiv 2501.03035

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

LAQuant improves long-decoding accuracy on quantized reasoning models like Qwen3-4B by 15pp on AIME25 via layer-wise lookahead loss, achieving 3.42x speedup over FP16.

Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.

FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

FluxMoE decouples MoE expert weights from persistent GPU residency via on-demand paging, achieving up to 3x throughput gains over vLLM in memory-constrained inference without accuracy loss.

The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

cs.AI · 2026-02-14 · unverdicted · novelty 6.0

Reducing precision from 16-bit to 8/4-bit in multi-hop reasoning creates a quantization trap that raises net energy consumption and degrades accuracy, breaking linear scaling laws.

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

cs.CL · 2026-05-19 · unverdicted · novelty 5.0

Mix-Quant quantizes prefilling to NVFP4 and keeps BF16 for decoding in agentic LLMs, achieving up to 3x prefilling speedup while largely preserving task performance on long-context and agentic benchmarks.

Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector

cs.CL · 2025-09-08 · unverdicted · novelty 3.0

Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.

citing papers explorer

Showing 6 of 6 citing papers.

LAQuant: A Simple Overhead-free Large Reasoning Model Quantization by Layer-wise Lookahead Loss cs.LG · 2026-05-09 · unverdicted · none · ref 29
LAQuant improves long-decoding accuracy on quantized reasoning models like Qwen3-4B by 15pp on AIME25 via layer-wise lookahead loss, achieving 3.42x speedup over FP16.
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression cs.LG · 2026-05-09 · unverdicted · none · ref 61
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving cs.LG · 2026-04-03 · unverdicted · none · ref 33
FluxMoE decouples MoE expert weights from persistent GPU residency via on-demand paging, achieving up to 3x throughput gains over vLLM in memory-constrained inference without accuracy loss.
The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning cs.AI · 2026-02-14 · unverdicted · none · ref 18
Reducing precision from 16-bit to 8/4-bit in multi-hop reasoning creates a quantization trap that raises net energy consumption and degrades accuracy, breaking linear scaling laws.
Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs cs.CL · 2026-05-19 · unverdicted · none · ref 17
Mix-Quant quantizes prefilling to NVFP4 and keeps BF16 for decoding in agentic LLMs, achieving up to 3x prefilling speedup while largely preserving task performance on long-context and agentic benchmarks.
Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector cs.CL · 2025-09-08 · unverdicted · none · ref 32
Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.

Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer