Helix: Serving large language models over heterogeneous gpus and network via max-flow

· 2025 · arXiv 9940.370721

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

baseline 2 background 1

citation-polarity summary

baseline 2 background 1

representative citing papers

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

cs.DC · 2026-04-11 · unverdicted · novelty 8.0

Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

Partitioning Unstructured Sparse Tensor Algebra for Load-Balanced Parallel Execution

cs.PL · 2026-04-19 · unverdicted · novelty 7.0

A new partitioning algorithm that provably load-balances arbitrary sparse tensor algebra expressions by generalizing parallel merging to multi-operand, multi-dimensional hierarchical structures, implemented in a compiler framework.

Feedback-Driven Execution for LLM-Based Binary Analysis

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precision and broader coverage than prior methods.

HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

cs.DC · 2025-12-13 · unverdicted · novelty 7.0

HetRL delivers up to 9.17x higher throughput for LLM RL training on heterogeneous GPUs by using hybrid and ILP-based schedulers to solve a joint optimization problem over computation and data dependencies.

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

cs.AI · 2026-05-20 · unverdicted · novelty 6.0

PALS adds dynamic GPU power capping to LLM serving frameworks like vLLM, jointly tuning it with batch size via offline models and feedback control to improve energy efficiency up to 26.3% and cut QoS violations 4-7x on dense and MoE models.

ShuntServe: Cost-Efficient LLM Serving on Heterogeneous Spot GPU Clusters

cs.DC · 2026-06-17 · unverdicted · novelty 5.0

ShuntServe reports 1.42x and 1.35x higher throughput than baselines plus 31.9 percent and 31.2 percent cost-efficiency gains over on-demand instances for Llama-3.1-70B and Qwen3-32B on heterogeneous AWS spot clusters.

NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

cs.PF · 2026-06-02 · unverdicted · novelty 5.0

NetKV is a network-aware O(|D|) greedy scheduler for decode instance selection that reduces mean TTFT by up to 21.2% versus round-robin and 17.6% versus cache+load baselines in 64-GPU fat-tree simulations.

Distributed Generative Inference of LLM at Internet Scales with Multi-Dimensional Communication Optimization

cs.DC · 2026-04-22 · unverdicted · novelty 5.0

BloomBee is a distributed LLM inference system that achieves up to 1.76x higher throughput and 43.2% lower latency than prior decentralized systems by optimizing communication across multiple dimensions in low-bandwidth internet settings.

Misleading Microbenchmarks on the Java Virtual Machines

cs.PL · 2026-05-22 · unverdicted · novelty 4.0

Microbenchmarks on the JVM can produce misleading results due to unrealistic profiles collected during isolated execution despite following JMH guidelines.

JEDI: Java Evaluation of Declarative and Imperative Queries

cs.PL · 2026-05-22 · unverdicted · novelty 4.0

JEDI is a generated benchmark suite converting SQL queries into Java Stream and imperative implementations to evaluate performance and identify efficient parallelization strategies.

citing papers explorer

Showing 10 of 10 citing papers.

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation cs.DC · 2026-04-11 · unverdicted · none · ref 55
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
Partitioning Unstructured Sparse Tensor Algebra for Load-Balanced Parallel Execution cs.PL · 2026-04-19 · unverdicted · none · ref 34
A new partitioning algorithm that provably load-balances arbitrary sparse tensor algebra expressions by generalizing parallel merging to multi-operand, multi-dimensional hierarchical structures, implemented in a compiler framework.
Feedback-Driven Execution for LLM-Based Binary Analysis cs.CR · 2026-04-16 · unverdicted · none · ref 26
FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precision and broader coverage than prior methods.
HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments cs.DC · 2025-12-13 · unverdicted · none · ref 27
HetRL delivers up to 9.17x higher throughput for LLM RL training on heterogeneous GPUs by using hybrid and ILP-based schedulers to solve a joint optimization problem over computation and data dependencies.
PALS: Power-Aware LLM Serving for Mixture-of-Experts Models cs.AI · 2026-05-20 · unverdicted · none · ref 22
PALS adds dynamic GPU power capping to LLM serving frameworks like vLLM, jointly tuning it with batch size via offline models and feedback control to improve energy efficiency up to 26.3% and cut QoS violations 4-7x on dense and MoE models.
ShuntServe: Cost-Efficient LLM Serving on Heterogeneous Spot GPU Clusters cs.DC · 2026-06-17 · unverdicted · none · ref 20
ShuntServe reports 1.42x and 1.35x higher throughput than baselines plus 31.9 percent and 31.2 percent cost-efficiency gains over on-demand instances for Llama-3.1-70B and Qwen3-32B on heterogeneous AWS spot clusters.
NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference cs.PF · 2026-06-02 · unverdicted · none · ref 19
NetKV is a network-aware O(|D|) greedy scheduler for decode instance selection that reduces mean TTFT by up to 21.2% versus round-robin and 17.6% versus cache+load baselines in 64-GPU fat-tree simulations.
Distributed Generative Inference of LLM at Internet Scales with Multi-Dimensional Communication Optimization cs.DC · 2026-04-22 · unverdicted · none · ref 25
BloomBee is a distributed LLM inference system that achieves up to 1.76x higher throughput and 43.2% lower latency than prior decentralized systems by optimizing communication across multiple dimensions in low-bandwidth internet settings.
Misleading Microbenchmarks on the Java Virtual Machines cs.PL · 2026-05-22 · unverdicted · none · ref 3
Microbenchmarks on the JVM can produce misleading results due to unrealistic profiles collected during isolated execution despite following JMH guidelines.
JEDI: Java Evaluation of Declarative and Imperative Queries cs.PL · 2026-05-22 · unverdicted · none · ref 4
JEDI is a generated benchmark suite converting SQL queries into Java Stream and imperative implementations to evaluate performance and identify efficient parallelization strategies.

Helix: Serving large language models over heterogeneous gpus and network via max-flow

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer