KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
hub
132333 Step t = 3 142434 Output 132333 Position IDs Inserted Spacing: +7 In-order position IDs Extract the following attributes... From the following product... Screen Size
20 Pith papers cite this work. Polarity classification is still indexing.
hub tools
years
2026 20representative citing papers
Hyper-Parallel Decoding enables parallel generation of independent sequences in LLMs via position ID manipulation, delivering up to 13.8X speedup for attribute value extraction.
Multiple-choice queries synthesized from Hoare triples enable more reliable identification of intended programs than labeled-example supervision in active learning for program disambiguation.
TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B (80.9% avg) at 80% retained parameters.
A joint architecture-token-bitwidth optimization of Vision Transformers delivers over 10x gains in throughput, parameters, FLOPs and energy on a semiconductor defect classification task while preserving required accuracy.
DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
Strix delivers sub-microsecond fault localisation, detection, and correction on NPUs with 1.04x slowdown and minimal hardware cost by system-level re-partitioning and targeted safeguards.
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.
Paper Espresso deploys LLMs to summarize and analyze trends across 13,300+ arXiv papers over 35 months, releasing metadata that shows non-saturating topic growth and higher engagement for novel topics.
LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.
A data-driven adaptive policy for KV-cache bit-width selection based on token importance features reduces decoding latency by ~18% and improves accuracy over static quantization while staying near FP16 levels on SmolLM models.
The paper proposes a transparent proxy framework for estimating LLM inference and training environmental impacts from natural-language application descriptions.
Gemma-4-E4B with few-shot chain-of-thought reaches the highest weighted accuracy of 0.675 at 14.9 GB VRAM, while the larger Gemma-4-26B-A4B MoE model scores 0.663 but uses 48.1 GB.
A survey that taxonomizes efficiency methods for LVLMs across the full inference pipeline, decouples the problem into information density, long-context attention, and memory limits, and outlines four future research frontiers with pilot insights.
This research agenda argues that cloud-native architectures, microservices, autoscaling, and emerging trends like serverless inference and federated learning are required to make large language models efficient and scalable.
citing papers explorer
-
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
-
Breaking the Autoregressive Chain: Hyper-Parallel Decoding for Efficient LLM-Based Attribute Value Extraction
Hyper-Parallel Decoding enables parallel generation of independent sequences in LLMs via position ID manipulation, delivering up to 13.8X speedup for attribute value extraction.
-
Choose, Don't Label: Multiple-Choice Query Synthesis for Program Disambiguation
Multiple-choice queries synthesized from Hoare triples enable more reliable identification of intended programs than labeled-example supervision in active learning for program disambiguation.
-
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.
-
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
-
RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation
RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
-
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B (80.9% avg) at 80% retained parameters.
-
Joint Architecture-Token-Bitwidth Multi-Axis Optimization of Vision Transformers for Semiconductor IC Packaging
A joint architecture-token-bitwidth optimization of Vision Transformers delivers over 10x gains in throughput, parameters, FLOPs and energy on a semiconductor defect classification task while preserving required accuracy.
-
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
-
LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation
LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
-
Strix: Re-thinking NPU Reliability from a System Perspective
Strix delivers sub-microsecond fault localisation, detection, and correction on NPUs with 1.04x slowdown and minimal hardware cost by system-level re-partitioning and targeted safeguards.
-
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.
-
Paper Espresso: From Paper Overload to Research Insight
Paper Espresso deploys LLMs to summarize and analyze trends across 13,300+ arXiv papers over 35 months, releasing metadata that shows non-saturating topic growth and higher engagement for novel topics.
-
Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference
LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.
-
Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs
A data-driven adaptive policy for KV-cache bit-width selection based on token importance features reduces decoding latency by ~18% and improves accuracy over static quantization while staying near FP16 levels on SmolLM models.
-
Transparent Screening for LLM Inference and Training Impacts
The paper proposes a transparent proxy framework for estimating LLM inference and training environmental impacts from natural-language application descriptions.
-
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
Gemma-4-E4B with few-shot chain-of-thought reaches the highest weighted accuracy of 0.675 at 14.9 GB VRAM, while the larger Gemma-4-26B-A4B MoE model scores 0.663 but uses 48.1 GB.
-
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
A survey that taxonomizes efficiency methods for LVLMs across the full inference pipeline, decouples the problem into information density, long-context attention, and memory limits, and outlines four future research frontiers with pilot insights.
-
Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda
This research agenda argues that cloud-native architectures, microservices, autoscaling, and emerging trends like serverless inference and federated learning are required to make large language models efficient and scalable.