pith. machine review for the scientific record. sign in

arxiv: 2604.10180 · v1 · submitted 2026-04-11 · 💻 cs.DC · cs.LG

Recognition: unknown

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

Chenxi Wang, Chunwei Xia, Huimin Cui, Jin Qin, Junhao Hu, Lei Chen, Mingxing Zhang, Tao Xie, Tiancheng Hu, Ting Cao, Yizhou Shan, Yuzheng Wang, Zheng Wang

Pith reviewed 2026-05-10 15:47 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords GPU disaggregationheterogeneous computingkernel-level schedulinglarge model inferenceAI servingPTX analysispipelined execution
0
0 comments X

The pith

Tessera disaggregates large-model inference at the kernel level to match heterogeneous GPU strengths more effectively than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that coarse-grained disaggregation of AI workloads leaves performance and cost gains on the table when GPUs of different types are available in a cluster. Tessera instead treats individual kernels as the unit of assignment because they display distinct compute, memory, and bandwidth needs within the same model. It extracts exact producer-consumer relations from PTX binaries offline, pipelines communication behind computation, and adjusts the schedule at runtime to the current workload mix. The resulting system runs on four different model families and five GPU types up to 16-GPU scale, delivering higher throughput and lower cost while supporting architectures that earlier disaggregation schemes cannot handle. A sympathetic reader cares because modern GPU fleets are already heterogeneous; finer-grained software mapping could turn that diversity into an advantage rather than a scheduling headache.

Core claim

Tessera performs the first kernel-granularity disaggregation for large-model inference across heterogeneous GPUs. Offline PTX analysis yields precise inter-kernel dependencies that guarantee correctness; a pipelined execution model overlaps data movement with kernel execution; and a workload-aware scheduler with lightweight adaptation chooses which GPU type runs each kernel at each step. Across evaluated configurations the approach improves serving throughput by up to 2.3 times and cost efficiency by up to 1.6 times relative to existing disaggregation baselines, generalizes to model families unsupported by prior work, and allows certain heterogeneous GPU pairs to surpass the throughput of a

What carries the argument

Kernel-granularity disaggregation driven by PTX-derived dependency graphs, pipelined cross-GPU execution, and adaptive workload scheduling.

If this is right

  • Certain mixed-GPU pairs under Tessera exceed the throughput of two identical high-end GPUs while using cheaper hardware.
  • The technique applies to model architectures that previous disaggregation systems cannot support.
  • Serving throughput rises by up to 2.3 times and cost efficiency by up to 1.6 times versus existing disaggregation methods.
  • The same mechanisms scale to clusters containing up to 16 GPUs drawn from five distinct GPU models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Data-center operators could deliberately purchase more varied GPU mixes if software reliably maps kernels across them, potentially lowering total acquisition cost.
  • The PTX-analysis step suggests analogous low-level dependency extraction could be applied to other accelerator instruction sets for portable heterogeneous scheduling.
  • Training workloads might benefit from the same kernel-level split if dependency graphs can be maintained across gradient and optimizer steps.

Load-bearing premise

Kernels inside one application have sufficiently varied resource demands that splitting at this level aligns work with different GPU types, and PTX analysis can recover the necessary dependencies accurately enough for safe pipelining without prohibitive overhead.

What would settle it

Measure end-to-end serving throughput and cost on the same heterogeneous GPU pairs when the system is forced to fall back to model-level or layer-level assignment; if the gains disappear or reverse, the kernel-granularity premise does not hold.

Figures

Figures reproduced from arXiv: 2604.10180 by Chenxi Wang, Chunwei Xia, Huimin Cui, Jin Qin, Junhao Hu, Lei Chen, Mingxing Zhang, Tao Xie, Tiancheng Hu, Ting Cao, Yizhou Shan, Yuzheng Wang, Zheng Wang.

Figure 1
Figure 1. Figure 1: Simplified example of kernel disaggregation. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Kernel heterogeneity between A100 and L40s. (a) CDF [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Kernel duration ratio (L40s / A100) for GPT-oss 20B [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Design overview of Tessera. III. DESIGN Tessera is the first system that disaggregates kernel execu￾tion across heterogeneous GPUs. The key design goal is to ensure kernel disaggregation is correct and efficient. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An execution example of the GPU worker. Com [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Offline throughput across five workloads and three heterogeneous GPU pairs. Left y-axis: tokens/s for LLM, MLLM, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cluster-scale offline throughput. (a) GPT-oss 20B on [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: (a) Impact of network bandwidth on GPT-oss 20B [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sensitivity of normalized latency and policy switch [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
read the original abstract

Disaggregation maps parts of an AI workload to different types of GPUs, offering a path to utilize modern heterogeneous GPU clusters. However, existing solutions operate at a coarse granularity and are tightly coupled to specific model architectures, leaving much room for performance improvement. This paper presents Tessera, the first kernel disaggregation system to improve performance and cost efficiency on heterogeneous GPUs for large model inference. Our key insight is that kernels within a single application exhibit diverse resource demands, making them the most suitable granularity for aligning computation with hardware capabilities. Tessera integrates offline analysis with online adaptation by extracting precise inter-kernel dependencies from PTX to ensure correctness, overlapping communication with computation through a pipelined execution model, and employing workload-aware scheduling with lightweight runtime adaptation. Extensive evaluations across five heterogeneous GPUs and four model architectures, scaling up to 16 GPUs, show that Tessera improves serving throughput and cost efficiency by up to 2.3x and 1.6x, respectively, compared to existing disaggregation methods, while generalizing to model architectures where prior approaches do not apply. Surprisingly, a heterogeneous GPU pair under Tessera can even exceed the throughput of two homogeneous high-end GPUs at a lower cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. Tessera is a kernel-granularity disaggregation system for large-model inference on heterogeneous GPU clusters. It performs offline PTX analysis to extract inter-kernel dependencies, employs a pipelined execution model that overlaps communication with computation, and uses workload-aware scheduling with lightweight runtime adaptation. The system is evaluated on up to 16 GPUs across five heterogeneous GPU types and four model architectures, reporting up to 2.3× serving throughput and 1.6× cost-efficiency gains over existing disaggregation methods while generalizing to architectures unsupported by prior work; a heterogeneous GPU pair can even exceed the throughput of two homogeneous high-end GPUs at lower cost.

Significance. If the PTX-based dependency model and pipelined scheduling deliver the claimed correctness and overlap, Tessera would meaningfully advance utilization of heterogeneous GPU hardware for inference workloads, with broad applicability beyond architecture-specific prior art. The scale of the evaluation (multiple GPU types, model families, and cluster sizes) and the counter-intuitive result that heterogeneous pairs can outperform homogeneous high-end pairs constitute concrete strengths that would be valuable to the systems community.

major comments (2)
  1. [§4.2] §4.2 (PTX Dependency Extraction): The central correctness and performance argument rests on the claim that static PTX analysis produces a precise inter-kernel dependency graph sufficient for safe pipelined disaggregation. The manuscript does not describe how the analysis handles dynamic allocations, CUDA runtime decisions, or calls into vendor libraries (cuBLAS, cuDNN), which are common sources of hidden dependencies. If the graph is incomplete, the runtime must either insert extra synchronizations (reducing overlap and undermining the 2.3× throughput numbers) or risk incorrect execution.
  2. [§5] §5 (Evaluation Methodology): The reported 2.3× throughput and 1.6× cost gains are load-bearing for the paper’s contribution, yet the evaluation section provides insufficient detail on baseline implementations, whether they were re-tuned for the heterogeneous setting, and the presence of error bars or statistical significance across runs. Without these, it is impossible to determine whether the gains are attributable to kernel-granularity disaggregation or to differences in experimental setup.
minor comments (2)
  1. [Abstract] The abstract states that Tessera “ensures correctness,” but the manuscript should explicitly list the assumptions under which the PTX analysis is sound (e.g., absence of certain CUDA features) so readers can assess the scope of the guarantee.
  2. [Figures 6–9] Figure captions and axis labels in the performance plots should include the exact GPU models and batch sizes used for each data point to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (PTX Dependency Extraction): The central correctness and performance argument rests on the claim that static PTX analysis produces a precise inter-kernel dependency graph sufficient for safe pipelined disaggregation. The manuscript does not describe how the analysis handles dynamic allocations, CUDA runtime decisions, or calls into vendor libraries (cuBLAS, cuDNN), which are common sources of hidden dependencies. If the graph is incomplete, the runtime must either insert extra synchronizations (reducing overlap and undermining the 2.3× throughput numbers) or risk incorrect execution.

    Authors: We agree that §4.2 would benefit from explicit discussion of these cases to strengthen the correctness argument. The offline PTX analysis constructs the dependency graph from kernel launches, memory operations, and control flow, with conservative edges added for uncertain cases. We will revise the section to describe how dynamic allocations are handled via over-approximation of subsequent accesses, how CUDA runtime decisions are approximated statically, and how vendor library calls (cuBLAS, cuDNN) are modeled as atomic units using their API tensor signatures. The runtime adaptation layer includes lightweight verification to insert minimal barriers only when needed. These additions will clarify that the reported overlap and throughput gains remain valid under the extended handling. revision: yes

  2. Referee: [§5] §5 (Evaluation Methodology): The reported 2.3× throughput and 1.6× cost gains are load-bearing for the paper’s contribution, yet the evaluation section provides insufficient detail on baseline implementations, whether they were re-tuned for the heterogeneous setting, and the presence of error bars or statistical significance across runs. Without these, it is impossible to determine whether the gains are attributable to kernel-granularity disaggregation or to differences in experimental setup.

    Authors: We concur that more methodological transparency is required. We will expand §5 with a dedicated subsection detailing each baseline implementation, including how prior disaggregation methods were re-implemented and tuned specifically for heterogeneous GPU configurations (e.g., adjusting placement and communication parameters). We will also report all throughput and cost results with error bars derived from multiple independent runs and include statistical significance testing to confirm that the observed improvements are attributable to Tessera’s kernel-granularity approach rather than setup variations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems implementation and evaluation

full rationale

This is an empirical systems paper describing an implemented kernel-disaggregation runtime with offline PTX analysis, pipelined execution, and workload-aware scheduling. All performance claims (2.3x throughput, 1.6x cost efficiency) are grounded in direct measurements across five GPU types, four model architectures, and up to 16 GPUs. No equations, fitted parameters, uniqueness theorems, or self-citations are used to derive results; the central argument is that the described mechanisms produce the observed speedups, which are independently verifiable by re-running the experiments. No step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about kernel resource diversity and PTX extractability rather than new mathematical axioms, free parameters, or invented entities. No free parameters or new postulated entities are introduced.

axioms (2)
  • domain assumption Kernels within a single application exhibit diverse resource demands making them the most suitable granularity for aligning computation with hardware capabilities
    Stated as the key insight in the abstract.
  • domain assumption Precise inter-kernel dependencies can be extracted from PTX to ensure correctness
    Required for the offline analysis component described in the abstract.

pith-pipeline@v0.9.0 · 5547 in / 1488 out tokens · 46623 ms · 2026-05-10T15:47:53.269100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 26 canonical work pages · 8 internal anchors

  1. [1]

    Ppipe: efficient video analytics serving on heterogeneous GPU clusters via pool-based pipeline parallelism,

    Z. J. Kong, Q. Xu, and Y . C. Hu, “Ppipe: efficient video analytics serving on heterogeneous GPU clusters via pool-based pipeline parallelism,” inProceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC 25). [Online]. Available: https://www.usenix.org/conference/atc25/presentation/kong

  2. [2]

    Rearchitecting Datacenter Lifecycle for AI: A TCO-Driven Framework,

    J. Stojkovic, C. Zhang, ´I. Goiri, and R. Bianchini, “Rearchitecting Datacenter Lifecycle for AI: A TCO-Driven Framework,”arXiv preprint arXiv:2509.26534, 20025. [Online]. Available: https://arxiv.org/abs/25 09.26534

  3. [3]

    Compute Engine,

    “Compute Engine,” https://cloud.google.com/compute/docs/gpus, 2025

  4. [4]

    How Amazon Achieved Rocketing Sales and Growth From AI,

    K. Wheeler, “How Amazon Achieved Rocketing Sales and Growth From AI,” https://aimagazine.com/news/how-amazonachieved-rocketing-sales -and-growth-from-ai, 2025

  5. [5]

    60+ ChatGPT Statistics And Facts You Need to Know in 2024,

    L. Marina, “60+ ChatGPT Statistics And Facts You Need to Know in 2024,” https://invgate.com/blog/chatgpt-statistics/, 2024

  6. [6]

    The cost of dynamic reasoning: Demystify- ing ai agents and test-time scaling from an ai infrastructure perspective.arXiv preprint arXiv:2506.04301, 2025

    J. Kim, B. Shin, J. Chung, and M. Rhu, “The cost of dynamic reasoning: Demystifying ai agents and test-time scaling from an ai infrastructure perspective.” [Online]. Available: https://arxiv.org/abs/2506.04301

  7. [7]

    arXiv preprint arXiv:2502.13965 , year =

    M. Luo, X. Shi, C. Cai, T. Zhang, J. Wong, Y . Wang, C. Wang, Y . Huang, Z. Chen, J. E. Gonzalez, and I. Stoica, “Autellix: An efficient serving engine for llm agents as general programs.” [Online]. Available: https://arxiv.org/abs/2502.13965

  8. [8]

    Cost-efficient large language model serving for multi-turn conversations with cachedattention,

    B. Gao, Z. He, P. Sharma, Q. Kang, D. Jevdjic, J. Deng, X. Yang, Z. Yu, and P. Zuo, “Cost-efficient large language model serving for multi-turn conversations with cachedattention,” in Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference (ATC 24), 2024. [Online]. Available: https: //www.usenix.org/conference/atc24/presentation/ga...

  9. [9]

    Splitwise: Efficient generative llm inference using phase splitting

    P. Patel, E. Choukse, C. Zhang, A. Shah, I. n. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative llm inference using phase splitting,” inProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA 24). [Online]. Available: https://doi.org/10.1109/ISCA59077.2024.00019

  10. [10]

    DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving,

    Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving,” in18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). [Online]. Available: https://www.usenix.org/conference/osdi24/presentation/zhon g-yinmin

  11. [11]

    Cronus: Efficient LLM inference on Heterogeneous GPU Clusters via Partially Disaggregated Prefill,

    Y . Liu, Q. Xu, and Y . C. Hu, “Cronus: Efficient LLM inference on Heterogeneous GPU Clusters via Partially Disaggregated Prefill,” arXiv preprint arXiv:2509.17357, 2025. [Online]. Available: https: //arxiv.org/abs/2509.17357

  12. [12]

    Megascale- Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism,

    R. Zhu, Z. Jiang, C. Jin, P. Wu, C. A. Stuardo, D. Wang, X. Zhang, H. Zhou, H. Wei, Y . Cheng, J. Xiao, X. Zhang, L. Liu, H. Lin, L.-W. Chang, J. Ye, X. Yu, X. Liu, X. Jin, and X. Liu, “Megascale- Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism,” inProceedings of the 39th ACM Special Interest Group on Data Communica...

  13. [13]

    Step-3 is large yet affordable: Model-system co-design for cost- effective decoding.arXiv preprint arXiv:2507.19427, 2025

    B. Wang, B. Wang, C. Wan, G. Huang, H. Huet al., “Step-3 is large yet affordable: Model-system co-design for cost-effective decoding,”arXiv preprint arXiv:2507.19427, 2025. [Online]. Available: https://arxiv.org/abs/2507.19427

  14. [15]

    NVIDIA. Parallel thread execution isa version 9.00ptx,

    “NVIDIA. Parallel thread execution isa version 9.00ptx,” https://docs.n vidia.com/cuda/parallel-thread-execution/index.html, 2025

  15. [16]

    Efficient memory management for large language model serving with pagedattention

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles (SOSP 23), 2023. [Online]. Available: https://doi.org/10.1145/3600006.3613165

  16. [17]

    PyTorch,

    “PyTorch,” https://github.com/pytorch/pytorch, 2026

  17. [18]

    Google cloud. virtual machines pricing,

    “Google cloud. virtual machines pricing,” https://cloud.google.com/pro ducts/compute/pricing/accelerator-optimized, 2026

  18. [19]

    NVIDIA MGX,

    “NVIDIA MGX,” https://www.nvidia.com/en-us/data-center/products/m gx/, 2026

  19. [20]

    Supermicro MGX Systems,

    “Supermicro MGX Systems,” https://www.supermicro.com/zh cn/acce lerators/nvidia/mgx, 2026

  20. [21]

    GIGABYTE MGX Server,

    “GIGABYTE MGX Server,” https://www.gigabyte.com/Enterprise/MG X-Server, 2026

  21. [22]

    Open sycl on heterogeneous gpu systems: A case of study,

    R. Carratal ´a-S´aez, F. J. and ´ujar, Y . Torres, A. Gonzalez-Escribano, and D. R. Llanos, “Open sycl on heterogeneous gpu systems: A case of study,”arXiv preprint arXiv:2310.06947, 2023. [Online]. Available: https://arxiv.org/abs/2310.06947

  22. [23]

    Beyond hardware: Building the software foundation for hetero- geneous gpu,

    J. Liu, “Beyond hardware: Building the software foundation for hetero- geneous gpu,” Blog post, 2025, https://seclee.com/post/202510 hmc/

  23. [24]

    Basic Linear Algebra on NVIDIA Gpus,

    “Basic Linear Algebra on NVIDIA Gpus,” https://developer.nvidia.com /cublas, 2025

  24. [25]

    Roofline: an insightful visual performance model for multicore architectures,

    S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,”Commun. ACM,

  25. [26]
  26. [27]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,”arXiv preprint arXiv:2205.14135, 2022. [Online]. Available: https://doi.org/10.1145/ 1498765.1498785

  27. [28]

    Aegaeon: Effective gpu pooling for concurrent llm serving on the market

    W. Xingda, H. Zhuobin, S. Tianle, H. Yingyi, C. Rong, H. Mingcong, G. Jinyu, and C. Haibo, “Phoenixos: Concurrent OS-level gpu checkpoint and restore with validated speculation,” inProceedings of the ACM SIGOPS 31th Symposium on Operating Systems Principles (SOSP 25). [Online]. Available: https://doi.org/10.1145/3731569.3764813

  28. [29]

    PyTorch THCCachingAllocator,

    “PyTorch THCCachingAllocator,” https://github.com/torch/cutorch/blo b/master/lib/THC/THCCachingAllocator.cpp, 2026

  29. [30]

    Neutrino: fine-grained gpu kernel profiling via programmable probing,

    S. Huang and C. Wu, “Neutrino: fine-grained gpu kernel profiling via programmable probing,” inProceedings of the 19th USENIX Conference on Operating Systems Design and Implementation (OSDI 25). [Online]. Available: https://www.usenix.org/conference/osdi25/pre sentation/huang-songlin

  30. [31]

    Extending applications safely and efficiently,

    Y . Zheng, T. Yu, Y . Yang, Y . Hu, X. Lai, D. Williams, and A. Quinn, “Extending applications safely and efficiently,” inProceedings of the 19th USENIX Conference on Operating Systems Design and Implementation (OSDI 25). [Online]. Available: https://www.usenix.org /conference/osdi25/presentation/zheng-yusheng

  31. [32]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”arXiv preprint arXiv:1706.03762, 2023. [Online]. Available: https://arxiv.org/abs/1706.03762

  32. [33]

    NVIDIA tensorRT,

    NVIDIA, “NVIDIA tensorRT,” https://developer.nvidia.com/tensorrt, 2025

  33. [34]

    Graph Management,

    “Graph Management,” https://docs.nvidia.com/cuda/cuda-runtime-api/g roup CUDART GRAPH.html, 2025

  34. [35]

    CUDA Graph Best Practice for PyTorch,

    “CUDA Graph Best Practice for PyTorch,” https://docs.nvidia.com/dl -cuda-graph/latest/torch-cuda-graph/handling-dynamic-patterns.html, 2026

  35. [36]

    G. L. Nemhauser and L. A. Wolsey,Integer and combinatorial opti- mization. Wiley New York, 1988

  36. [37]

    Pipedream: generalized pipeline parallelism for dnn training,

    D. Narayanan, A. Harlap, A. Phanishayee, V . Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “Pipedream: generalized pipeline parallelism for dnn training,” inProceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP 19). [Online]. Available: https://doi.org/10.1145/3341301.3359646 11

  37. [38]

    Gurobi Optimizer,

    “Gurobi Optimizer,” https://www.gurobi.com/, 2026

  38. [39]

    Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect,

    “Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect,” https://developer.nvidia .com/blog/improving-network-performance-of-hpc-systems-using-nvi dia-magnum-io-nvshmem-and-gpudirect-async/, 2022

  39. [40]

    “NCCL,” https://github.com/NVIDIA/nccl, 2026

  40. [41]

    CUDA Runtime API:GPU Stream Management,

    “CUDA Runtime API:GPU Stream Management,” https://docs.nvidia. com/cuda/cuda-runtime-api/group CUDART STREAM.html#group CUDART STREAM, 2022

  41. [42]

    NVIDIA PTXAS,

    “NVIDIA PTXAS,” https://docs.nvidia.com/cuda/cuda-compiler-drive r-nvcc/, 2024

  42. [43]

    Rodinia: A benchmark suite for heterogeneous computing

    S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing.” [Online]. Available: https://doi.org/10.1109/IISWC.2009.5306797

  43. [44]

    CUDA Programming Guide CUDA Graphs,

    “CUDA Programming Guide CUDA Graphs,” https://docs.nvidia.com/ cuda/cuda-programming-guide/04-special-topics/cuda-graphs.html#, 2025

  44. [45]

    NCCL Communicator Quality of Service,

    “NCCL Communicator Quality of Service,” https://docs.nvidia.com/de eplearning/nccl/user-guide/docs/usage/communicators.html, 2026

  45. [46]

    Meta Llama 3,

    “Meta Llama 3,” https://llama.meta.com/llama3, 2024

  46. [47]

    gpt-oss-120b & gpt-oss-20b Model Card

    O. . S. Agarwal, L. Ahmad, J. Ai, and e. Sam Altman, “gpt- oss-120b and gpt-oss-20b Model Card,” 2025. [Online]. Available: https://arxiv.org/abs/2508.10925

  47. [48]

    Mamba-Codestral-7B-v0.1,

    “Mamba-Codestral-7B-v0.1,” https://huggingface.co/mistralai/Mamba -Codestral-7B-v0.1, 2025

  48. [49]

    Qwen2.5-VL,

    Q. Team, “Qwen2.5-VL,” https://qwenlm.github.io/blog/qwen2.5-vl/, January 2025

  49. [50]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y . Marek, and R. Rombach, “Scaling rectified flow transformers for high-resolution image synthesis,” 2024. [Online]. Available: https://arxiv.org/abs/2403.03206

  50. [51]

    Stable diffusion 3.5 medium,

    Stability AI, “Stable diffusion 3.5 medium,” https://huggingface.co/sta bilityai/stable-diffusion-3.5-medium, 2024

  51. [52]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    X. Chen, H. Fang, T.-Y . Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” 2015. [Online]. Available: https://arxiv.org/abs/1504.00325

  52. [53]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    J. Yu, Y . Xu, J. Y . Koh, T. Luong, G. Baid, Z. Wang, V . Vasudevan, A. Ku, Y . Yang, B. K. Ayan, B. Hutchinson, W. Han, Z. Parekh, X. Li, H. Zhang, J. Baldridge, and Y . Wu, “Scaling autoregressive models for content-rich text-to-image generation,” 2022. [Online]. Available: https://arxiv.org/abs/2206.10789

  53. [54]

    Nanoflow: towards optimal large language model serving throughput,

    K. Zhu, Y . Gao, Y . Zhao, L. Zhao, G. Zuo, Y . Gu, D. Xie, T. Tang, Q. Xu, Z. Ye, K. Kamahori, C.-Y . Lin, Z. Wang, S. Wang, A. Krishnamurthy, and B. Kasikci, “Nanoflow: towards optimal large language model serving throughput,” inProceedings of the 19th USENIX Conference on Operating Systems Design and Implementation (OSDI 25). [Online]. Available: https...

  54. [55]

    Helix: Serving large language models over heterogeneous gpus and network via max-flow,

    Y . Mei, Y . Zhuang, X. Miao, J. Yang, Z. Jia, and R. Vinayak, “Helix: Serving large language models over heterogeneous gpus and network via max-flow,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 25). [Online]. Available: https://doi.org/10.1145/3669940.3707215

  55. [56]

    Misa-akmc:achieve kinetic monte carlo simulation of 20 quadrillion atoms on gpu clusters,

    Z. Mo, J. Liao, H. Xu, Z. Zhou, and C. Xu, “Hetis: Serving llms in heterogeneous gpu clusters with fine-grained and dynamic parallelism,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 25). [Online]. Available: https://doi.org/10.1145/3712285.3759784

  56. [57]

    The impact of gpu dvfs on the energy and performance of deep learning: an empirical study,

    Z. Tang, Y . Wang, Q. Wang, and X. Chu, “The impact of gpu dvfs on the energy and performance of deep learning: an empirical study,” inProceedings of the Tenth ACM International Conference on Future Energy Systems, 2019. [Online]. Available: https://doi.org/10.1145/3307772.3328315

  57. [58]

    Qwen3 Technical Report

    A. Yanget al., “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

  58. [59]

    DeepSeek-V3 Technical Report

    DeepSeek-AI and etc., “Deepseek-v3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2412.19437

  59. [60]

    Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment,

    Y . Jiang, R. Yan, and B. Yuan, “Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment,” 2025. [Online]. Available: https://arxiv.org/abs/2502.07903

  60. [61]

    Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters,

    F. Strati, Z. Zhang, G. Manos, I. S. P ´eriz, Q. Hu, T. Chen, B. Buzcu, S. Han, P. Delgado, and A. Klimovic, “Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters,” inProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP 25). [Online]. Available: https://dl.acm.org/doi/10.1145/37315...

  61. [62]

    Metis: Fast Automatic Distributed Training on Heterogeneous GPUs,

    T. Um, B. Oh, M. Kang, W.-Y . Lee, G. Kim, D. Kim, Y . Kim, M. Muzzammil, and M. Jeon, “Metis: Fast Automatic Distributed Training on Heterogeneous GPUs,” in2024 USENIX Annual Technical Conference (USENIX ATC 24). [Online]. Available: https: //www.usenix.org/conference/atc24/presentation/um 12