ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.
hub Canonical reference
InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)
Canonical reference. 82% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constrained SkyPilot baseline.
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
TokenStack's heterogeneous HBM-PIM design with base-die control and topology-aware KV placement delivers 1.62x higher geometric-mean token throughput and 1.70x SLO-compliant serving capacity than AttAcc while cutting per-token energy by 30-47%.
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
CuLifter recovers types from untyped GPU register files via constraint propagation to lift 99.98% of 24,437 functions across 919 cubins to valid LLVM IR.
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.
RetroInfer introduces the wave index and wave buffer to realize sparse KV-cache attention for long-context LLM inference with up to 4.4X throughput gains while matching full-attention accuracy.
SkipDisk is a disk-memory hybrid ANN search that achieves 63-85% of HNSW latency at 10-20% memory footprint via dedicated pivots for tighter lower bounds, three-level pruning, and decoupled async I/O.
EdgeServing schedules multi-DNN inference on edge GPUs via time-division sharing and early exits, using a stability score to minimize system-wide SLO violations and P95 latency.
Strait cuts high-priority deadline violations in ML inference serving by 1-11 percentage points through contention modeling and priority scheduling under high GPU load.
citing papers explorer
-
When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models
ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.
-
QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding
Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
-
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
-
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs
Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constrained SkyPilot baseline.
-
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
-
TokenStack: A Heterogeneous HBM-PIM Architecture and Runtime for Efficient LLM Inference
TokenStack's heterogeneous HBM-PIM design with base-die control and topology-aware KV placement delivers 1.62x higher geometric-mean token throughput and 1.70x SLO-compliant serving capacity than AttAcc while cutting per-token energy by 30-47%.
-
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
-
CuLifter: Lifting GPU Binaries to Typed IR
CuLifter recovers types from untyped GPU register files via constraint propagation to lift 99.98% of 24,437 functions across 919 cubins to valid LLVM IR.
-
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
-
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.
-
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
RetroInfer introduces the wave index and wave buffer to realize sparse KV-cache attention for long-context LLM inference with up to 4.4X throughput gains while matching full-attention accuracy.
-
Low-Latency Out-of-Core ANN Search in High-Dimensional Space
SkipDisk is a disk-memory hybrid ANN search that achieves 63-85% of HNSW latency at 10-20% memory footprint via dedicated pivots for tighter lower bounds, three-level pruning, and decoupled async I/O.
-
EdgeServing: Deadline-Aware Multi-DNN Serving at the Edge
EdgeServing schedules multi-DNN inference on edge GPUs via time-division sharing and early exits, using a stability score to minimize system-wide SLO violations and P95 latency.
-
Strait: Perceiving Priority and Interference in ML Inference Serving
Strait cuts high-priority deadline violations in ML inference serving by 1-11 percentage points through contention modeling and priority scheduling under high GPU load.