hub Canonical reference

InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, Ion Stoica · 2023

Canonical reference. 82% of citing Pith papers cite this work as background.

14 Pith papers citing it

Background 82% of classified citations

browse 14 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 9 method 2

citation-polarity summary

background 9 use method 2

representative citing papers

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

cs.IR · 2026-04-29 · unverdicted · novelty 7.0

ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.

QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

quant-ph · 2026-04-28 · unverdicted · novelty 7.0

Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

cs.CL · 2026-04-28 · accept · novelty 7.0

SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

Incisor: Ex Ante Cloud Instance Selection for HPC Jobs

cs.DC · 2026-04-27 · unverdicted · novelty 7.0

Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constrained SkyPilot baseline.

MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems

cs.AR · 2026-05-07 · unverdicted · novelty 6.0

MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.

TokenStack: A Heterogeneous HBM-PIM Architecture and Runtime for Efficient LLM Inference

cs.AR · 2026-05-07 · unverdicted · novelty 6.0

TokenStack's heterogeneous HBM-PIM design with base-die control and topology-aware KV placement delivers 1.62x higher geometric-mean token throughput and 1.70x SLO-compliant serving capacity than AttAcc while cutting per-token energy by 30-47%.

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

cs.CR · 2026-04-30 · unverdicted · novelty 6.0

TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.

CuLifter: Lifting GPU Binaries to Typed IR

cs.AR · 2026-04-30 · unverdicted · novelty 6.0

CuLifter recovers types from untyped GPU register files via constraint propagation to lift 99.98% of 24,437 functions across 919 cubins to valid LLVM IR.

Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

cs.LG · 2026-04-28 · unverdicted · novelty 6.0

RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

cs.LG · 2025-05-05 · conditional · novelty 6.0

RetroInfer introduces the wave index and wave buffer to realize sparse KV-cache attention for long-context LLM inference with up to 4.4X throughput gains while matching full-attention accuracy.

Low-Latency Out-of-Core ANN Search in High-Dimensional Space

cs.DB · 2026-05-07 · unverdicted · novelty 5.0

SkipDisk is a disk-memory hybrid ANN search that achieves 63-85% of HNSW latency at 10-20% memory footprint via dedicated pivots for tighter lower bounds, three-level pruning, and decoupled async I/O.

EdgeServing: Deadline-Aware Multi-DNN Serving at the Edge

cs.DC · 2026-05-07 · unverdicted · novelty 5.0

EdgeServing schedules multi-DNN inference on edge GPUs via time-division sharing and early exits, using a stability score to minimize system-wide SLO violations and P95 latency.

Strait: Perceiving Priority and Interference in ML Inference Serving

cs.LG · 2026-04-30 · unverdicted · novelty 5.0 · 2 refs

Strait cuts high-priority deadline violations in ML inference serving by 1-11 percentage points through contention modeling and priority scheduling under high GPU load.

citing papers explorer

Showing 14 of 14 citing papers.

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models cs.IR · 2026-04-29 · unverdicted · none · ref 18
ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.
QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding quant-ph · 2026-04-28 · unverdicted · none · ref 59
Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models cs.CL · 2026-04-28 · accept · none · ref 37
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs cs.DC · 2026-04-27 · unverdicted · none · ref 51
Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constrained SkyPilot baseline.
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems cs.AR · 2026-05-07 · unverdicted · none · ref 32
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
TokenStack: A Heterogeneous HBM-PIM Architecture and Runtime for Efficient LLM Inference cs.AR · 2026-05-07 · unverdicted · none · ref 24
TokenStack's heterogeneous HBM-PIM design with base-die control and topology-aware KV placement delivers 1.62x higher geometric-mean token throughput and 1.70x SLO-compliant serving capacity than AttAcc while cutting per-token energy by 30-47%.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning cs.CR · 2026-04-30 · unverdicted · none · ref 15
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
CuLifter: Lifting GPU Binaries to Typed IR cs.AR · 2026-04-30 · unverdicted · none · ref 36
CuLifter recovers types from untyped GPU register files via constraint propagation to lift 99.98% of 24,437 functions across 919 cubins to valid LLVM IR.
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving cs.LG · 2026-04-29 · unverdicted · none · ref 29
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts cs.LG · 2026-04-28 · unverdicted · none · ref 4
RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference cs.LG · 2025-05-05 · conditional · none · ref 44
RetroInfer introduces the wave index and wave buffer to realize sparse KV-cache attention for long-context LLM inference with up to 4.4X throughput gains while matching full-attention accuracy.
Low-Latency Out-of-Core ANN Search in High-Dimensional Space cs.DB · 2026-05-07 · unverdicted · none · ref 65
SkipDisk is a disk-memory hybrid ANN search that achieves 63-85% of HNSW latency at 10-20% memory footprint via dedicated pivots for tighter lower bounds, three-level pruning, and decoupled async I/O.
EdgeServing: Deadline-Aware Multi-DNN Serving at the Edge cs.DC · 2026-05-07 · unverdicted · none · ref 4
EdgeServing schedules multi-DNN inference on edge GPUs via time-division sharing and early exits, using a stability score to minimize system-wide SLO violations and P95 latency.
Strait: Perceiving Priority and Interference in ML Inference Serving cs.LG · 2026-04-30 · unverdicted · none · ref 61 · 2 links
Strait cuts high-priority deadline violations in ML inference serving by 1-11 percentage points through contention modeling and priority scheduling under high GPU load.

InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer