ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
hub
Omniquant: Omnidirectionally calibrated quan- tization for large language models
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
years
2026 11representative citing papers
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
BitRL enables on-device RL agents via 1-bit quantized language models, delivering 10-16x memory reduction and 3-5x energy efficiency gains with 85-98% retained performance.
SDLLM is a spike-driven LLM that uses gamma-SQP two-step encoding, bidirectional symmetric quantization, and membrane potential clipping to achieve 7x lower energy consumption and 4.2% higher accuracy than prior spike-based language models.
Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
LOCALUT delivers 1.82x geometric mean speedup for quantized DNN inference on real UPMEM DRAM-PIM devices by using operation-packed LUTs with canonicalization, reordering, and slice streaming.
DA-PTQ quantizes VLAs by compensating cross-space distortions and allocating mixed precision to minimize motion errors and kinematic drift in trajectories.
BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.
Three incremental VQ techniques using cosine similarity and NAS for neural net weight compression yield design insights without consistent accuracy gains over existing approaches.
ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.
citing papers explorer
-
Search Your Block Floating Point Scales!
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
-
BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment
BitRL enables on-device RL agents via 1-bit quantized language models, delivering 10-16x memory reduction and 3-5x energy efficiency gains with 85-98% retained performance.
-
Rethinking Residual Errors in Compensation-based LLM Quantization
Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
-
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
-
BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design
BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.
-
Efficient VQ-QAT and Mixed Vector/Linear quantized Neural Networks
Three incremental VQ techniques using cosine similarity and NAS for neural net weight compression yield design insights without consistent accuracy gains over existing approaches.