hub

Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models

Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models , author= · 2024 · arXiv 2411.05007

25 Pith papers cite this work. Polarity classification is still indexing.

25 Pith papers citing it

read on arXiv browse 25 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation

cs.LG · 2026-05-31 · unverdicted · novelty 8.0

GPTQ-intrinsic LoRA augments GPTQ with intrinsic low-rank compensation via Hessian modification to achieve layer-wise reconstruction bounds that match information-theoretic lower bounds under structural assumptions.

OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

OrbitQuant is a data-agnostic PTQ technique for DiTs that uses RPBH rotation in a normalized basis to enable a single codebook across all inputs, achieving SOTA low-bit performance on FLUX.1, CogVideoX and similar models.

{\Omega}-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling

cs.CV · 2026-05-27 · unverdicted · novelty 7.0

Omega-QVLA is a post-training quantization framework achieving uniform W4A4 for VLA models' LLM backbone and DiT action head via composite SVD-Hadamard rotation and per-step scaling, matching FP16 success rates on LIBERO.

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

cs.LG · 2026-02-23 · unverdicted · novelty 7.0

QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memory savings on the quantized components.

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

cs.CL · 2025-12-01 · conditional · novelty 7.0

Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.

SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference

cs.LG · 2026-06-25 · unverdicted · novelty 6.0

SharQ combines input-adaptive N:M sparsity and FP4 quantization via sparse backbone plus dense residual, recovering 43-63% of the NVFP4-to-FP16 accuracy gap on Llama and Qwen models without calibration or retraining.

ReCache: Learning Budget-Aware Caching Schedules for Diffusion Models via REINFORCE

cs.CV · 2026-06-04 · unverdicted · novelty 6.0

ReCache learns recomputation schedules via policy gradients to maximize quality under a target compute budget for any caching mechanism in diffusion models.

LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

LASER introduces curvature-weighted SVD from second-order loss approximation and loss-aware rank allocation to compress VLMs, reporting over 2.3x decoding speedup under low-precision settings.

DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.

RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

RT-Lynx shifts DiT sparsity from weights to activations, reports up to 1.55x linear-layer speedup while preserving generation quality across multiple diffusion models.

Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

SplitQ improves low-bit PTQ for VLMs by isolating modality-specific outlier channels via MOCD and applying dual-branch adaptive calibration via ACC, outperforming prior methods on six datasets across W4A8 to W3A2 settings.

Search Your Block Floating Point Scales!

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

cs.LG · 2026-04-30 · unverdicted · novelty 6.0

ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.

QuantClaw: Precision Where It Matters for OpenClaw

cs.AI · 2026-04-24 · unverdicted · novelty 6.0

QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.

eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

eOptShrinkQ compresses KV caches to ~2.2 bits per entry via optimal spectral shrinkage and quantization, outperforming prior methods on LongBench while matching FP16 on multi-needle retrieval.

CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model

cs.CV · 2026-05-16 · unverdicted · novelty 5.0

CAR-SAM introduces MatMul-Aware Compensation and Joint Cross-Attention Reconstruction to enable stable 4-bit post-training quantization of SAM, outperforming prior PTQ methods by 14.6% mAP on SAM-B and 6.6% on SAM-L.

WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models

cs.CV · 2026-04-02 · unverdicted · novelty 5.0

WSVD delivers over 1.8x faster VLM decoding via weighted low-rank approximation at fine granularity plus quantization, without accuracy loss.

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

cs.AR · 2025-09-11 · unverdicted · novelty 5.0

PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

cs.LG · 2026-06-10 · unverdicted · novelty 4.0

INT8 W8A8 post-training quantization of Ideogram 4.0 preserves FP8 quality on a 200-prompt benchmark while outperforming NF4 on CLIP score and offering a favorable quality-memory trade-off via GGUF Q4_K.

Efficient Matrix Implementation for Rotary Position Embedding

cs.LG · 2026-04-10 · unverdicted · novelty 4.0

RoME reformulates RoPE as matrix operations to eliminate dimension-specific vector overhead and enable fused execution on modern hardware while remaining mathematically equivalent.

citing papers explorer

Showing 2 of 2 citing papers after filters.

QuantClaw: Precision Where It Matters for OpenClaw cs.AI · 2026-04-24 · unverdicted · none · ref 24
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond cs.AI · 2026-04-24 · unreviewed · ref 221

Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer