Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
Helix: Serving large language models over heterogeneous gpus and network via max-flow
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 10representative citing papers
A new partitioning algorithm that provably load-balances arbitrary sparse tensor algebra expressions by generalizing parallel merging to multi-operand, multi-dimensional hierarchical structures, implemented in a compiler framework.
FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precision and broader coverage than prior methods.
HetRL delivers up to 9.17x higher throughput for LLM RL training on heterogeneous GPUs by using hybrid and ILP-based schedulers to solve a joint optimization problem over computation and data dependencies.
PALS adds dynamic GPU power capping to LLM serving frameworks like vLLM, jointly tuning it with batch size via offline models and feedback control to improve energy efficiency up to 26.3% and cut QoS violations 4-7x on dense and MoE models.
ShuntServe reports 1.42x and 1.35x higher throughput than baselines plus 31.9 percent and 31.2 percent cost-efficiency gains over on-demand instances for Llama-3.1-70B and Qwen3-32B on heterogeneous AWS spot clusters.
NetKV is a network-aware O(|D|) greedy scheduler for decode instance selection that reduces mean TTFT by up to 21.2% versus round-robin and 17.6% versus cache+load baselines in 64-GPU fat-tree simulations.
BloomBee is a distributed LLM inference system that achieves up to 1.76x higher throughput and 43.2% lower latency than prior decentralized systems by optimizing communication across multiple dimensions in low-bandwidth internet settings.
Microbenchmarks on the JVM can produce misleading results due to unrealistic profiles collected during isolated execution despite following JMH guidelines.
JEDI is a generated benchmark suite converting SQL queries into Java Stream and imperative implementations to evaluate performance and identify efficient parallelization strategies.
citing papers explorer
-
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
-
Partitioning Unstructured Sparse Tensor Algebra for Load-Balanced Parallel Execution
A new partitioning algorithm that provably load-balances arbitrary sparse tensor algebra expressions by generalizing parallel merging to multi-operand, multi-dimensional hierarchical structures, implemented in a compiler framework.
-
Feedback-Driven Execution for LLM-Based Binary Analysis
FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precision and broader coverage than prior methods.
-
HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments
HetRL delivers up to 9.17x higher throughput for LLM RL training on heterogeneous GPUs by using hybrid and ILP-based schedulers to solve a joint optimization problem over computation and data dependencies.
-
PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
PALS adds dynamic GPU power capping to LLM serving frameworks like vLLM, jointly tuning it with batch size via offline models and feedback control to improve energy efficiency up to 26.3% and cut QoS violations 4-7x on dense and MoE models.
-
ShuntServe: Cost-Efficient LLM Serving on Heterogeneous Spot GPU Clusters
ShuntServe reports 1.42x and 1.35x higher throughput than baselines plus 31.9 percent and 31.2 percent cost-efficiency gains over on-demand instances for Llama-3.1-70B and Qwen3-32B on heterogeneous AWS spot clusters.
-
NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference
NetKV is a network-aware O(|D|) greedy scheduler for decode instance selection that reduces mean TTFT by up to 21.2% versus round-robin and 17.6% versus cache+load baselines in 64-GPU fat-tree simulations.
-
Distributed Generative Inference of LLM at Internet Scales with Multi-Dimensional Communication Optimization
BloomBee is a distributed LLM inference system that achieves up to 1.76x higher throughput and 43.2% lower latency than prior decentralized systems by optimizing communication across multiple dimensions in low-bandwidth internet settings.
-
Misleading Microbenchmarks on the Java Virtual Machines
Microbenchmarks on the JVM can produce misleading results due to unrealistic profiles collected during isolated execution despite following JMH guidelines.
-
JEDI: Java Evaluation of Declarative and Imperative Queries
JEDI is a generated benchmark suite converting SQL queries into Java Stream and imperative implementations to evaluate performance and identify efficient parallelization strategies.