arxiv: 2605.02953 · v1 · submitted 2026-05-02 · 💻 cs.PL

Recognition: unknown

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

Size Zheng , Xuegui Zheng , Hanshi Sun , Qi Hou , Wenlei Bao , Shiyu Li , Haojie Duanmu , Jin Fang

show 11 more authors

Chenli Xue Chenhui Huang Yuanqiang Liu Renze Chen Ningxin Zheng Dongyang Wang Li-wen Chang Liqiang Lu Yun Liang Jidong Zhai Xin Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:54 UTC · model grok-4.3

classification 💻 cs.PL

keywords tensor compilerdistributed computingparallel tensor programsLLM scalingmulti-level tilingCUDA performanceheterogeneous hardwaremodel FLOPS utilization

0 comments

The pith

DITRON's Core-Device-Task hierarchy lets a compiler generate distributed tensor kernels that match or exceed expert-tuned CUDA libraries on large clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DITRON as a tile-level compiler designed to handle the complexities of distributed tensor programs for scaling large language models. It introduces a three-level programming abstraction to map computations across cores, devices, and tasks while hiding inter-node communication details. This allows the system to support varied parallelism approaches and deliver performance comparable to or better than specialized libraries without extensive manual optimization. A reader would care because current LLM scaling is limited by rigid distributed programming, and a flexible compiler could accelerate development and reduce costs in training and inference.

Core claim

DITRON is a scalable tile-level compiler that introduces a novel hierarchical programming abstraction spanning Core, Device, and Task levels to map tensor programs efficiently onto heterogeneous distributed hardware. This abstraction supports diverse parallelism strategies while abstracting away the complexity of inter-node and intra-node communication, achieving performance parity with or exceeding expert-tuned CUDA libraries with speedups of 6%-30% on isolated kernels and 5%-30% on end-to-end inference.

What carries the argument

The hierarchical Core-Device-Task programming abstraction that maps tensor programs onto distributed hardware at multiple parallelism levels while abstracting communication.

If this is right

DITRON achieves 6% to 30% speedups on isolated kernels compared to expert libraries.
It delivers 5% to 30% gains on end-to-end inference tasks in systems like vLLM.
Over 10% improvement in model FLOPS utilization during training leads to substantial computational savings.
The system shows portability across NVIDIA and AMD hardware platforms.
Enterprise deployment demonstrates practical benefits in both training and inference workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the abstraction proves general, it could reduce the need for low-level CUDA programming expertise in developing new distributed AI systems.
Extending the hierarchy might allow seamless support for emerging hardware without rewriting kernels.
Integration into higher-level frameworks could further automate optimization for rapidly changing model architectures.

Load-bearing premise

The hierarchical Core-Device-Task abstraction can efficiently support diverse parallelism strategies and map tensor programs onto heterogeneous distributed hardware without introducing substantial overhead or requiring significant manual expert tuning.

What would settle it

A benchmark where DITRON-generated code falls more than 10% behind hand-optimized CUDA implementations on a new large-scale LLM model architecture across multiple cluster sizes would challenge the performance claims.

read the original abstract

The scaling of large language models (LLMs) is currently bottlenecked by the rigidity of distributed programming. While high-performance libraries like CuBLAS and NCCL provide optimized primitives, they lack the flexibility required for rapidly evolving model architectures. Conversely, existing tensor compilers fail to address the complex memory hierarchy of distributed clusters effectively. To bridge this gap, we propose DITRON, a scalable tile-level compiler that democratizes high-performance distributed kernel development. DITRON introduces a novel hierarchical programming abstraction spanning Core, Device, and Task levels to map tensor programs efficiently onto heterogeneous distributed hardware. This abstraction allows DITRON to support diverse parallelism strategies while abstracting away the complexity of inter-node and intra-node communication. Evaluated across large-scale clusters, DITRON achieves performance parity with or exceeding expert-tuned CUDA libraries, delivering speedups of $6\%-30\%$ on isolated kernels and $5\%-30\%$ on end-to-end inference in vLLM. Furthermore, DITRON demonstrates strong portability, achieving significant speedups on both NVIDIA and AMD platforms. \ours{} has been deployed at the enterprise level for both training and inference. It achieves an MFU improvement of over 10\% in training tasks, saving approximately 500,000 GPU hours of training cost per month. For inference tasks, it delivers an end-to-end gain of over 20\% and has been applied to cloud service inference and edge inference scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DITRON's Core-Device-Task tiling hierarchy is a concrete attempt to make distributed tensor compilation more flexible than rigid libraries, but the reported speedups rest on evaluations that still need clearer ablations and baselines.

read the letter

The main thing to know is that DITRON adds a three-level tiling abstraction (Core, Device, Task) meant to let a compiler map tensor programs onto distributed clusters while hiding most of the communication and memory details. The authors claim this lets it match or beat hand-tuned CUDA and NCCL on kernels, deliver 5-30% end-to-end gains inside vLLM, run on both NVIDIA and AMD, and produce over 10% MFU uplift in real training workloads that saves roughly 500k GPU hours per month. Those deployment numbers are the strongest part of the story so far. What is actually new is the explicit multi-level schedule that tries to cover data, tensor, and pipeline parallelism in one framework rather than forcing users to stitch together separate passes. The paper does a reasonable job showing that the same compiler can target heterogeneous hardware without complete rewrites. The soft spots are in the evaluation. The speedups are presented as ranges with no error bars, no detailed baseline descriptions, and no ablations that separate the contribution of the new hierarchy from ordinary fusion or low-level codegen improvements. The stress-test point still stands: if the abstraction still requires per-strategy annotations or manual decisions for cross-device movement, the claimed generality and low tuning cost are harder to accept. The results shown are mostly on regular kernels and vLLM inference; there is less evidence for irregular or communication-heavy cases. This paper is aimed at people who build or use tensor compilers for large-scale training and inference. It is coherent enough and addresses a real pain point, so it deserves a serious referee who can ask for the missing methodology and ablation data. I would send it to review rather than desk-reject, but with clear instructions to strengthen the experimental section.

Referee Report

2 major / 2 minor

Summary. The paper presents DITRON, a scalable tile-level compiler for parallel tensor programs on distributed heterogeneous hardware. It introduces a hierarchical Core-Device-Task programming abstraction to map tensor computations while abstracting inter-node/intra-node communication and supporting diverse parallelism strategies (data, tensor, pipeline). The central claims are performance parity or superiority to expert-tuned CUDA/NCCL libraries, with 6%-30% speedups on isolated kernels, 5%-30% on vLLM end-to-end inference, >10% MFU gains in training (saving ~500k GPU hours/month), portability to NVIDIA/AMD, and enterprise deployment for training and inference.

Significance. If the performance results and low-overhead mapping claims hold under scrutiny, the work would be significant for compilers and parallel systems in ML. It targets the rigidity of current distributed programming for rapidly evolving LLMs by offering a more flexible compiler-based alternative to hand-tuned libraries, with demonstrated practical impact via production deployment and large-scale resource savings. The cross-platform portability is a notable strength.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation section: The specific quantitative claims (6%-30% kernel speedups, 5%-30% vLLM gains, >10% MFU improvement, and 500,000 GPU hours/month savings) are stated without any description of experimental methodology, chosen baselines (e.g., exact cuBLAS/NCCL versions or hand-tuned implementations), hardware configurations, number of runs, error bars, or data availability. This directly undermines verifiability of the central performance claims.
[Core-Device-Task abstraction and Evaluation] The Core-Device-Task abstraction (described in the main technical sections): The paper asserts that this three-level hierarchy automatically supports diverse parallelism strategies and maps programs onto distributed hardware with low overhead and minimal expert tuning. However, the evaluation provides no ablations isolating the abstraction's contribution, no counter-examples on irregular tiling or heavy cross-device communication patterns, and no discussion of whether schedule annotations or manual fusion decisions are still required. Without these, the 6-30% speedups cannot be confidently attributed to the proposed abstraction rather than low-level codegen.

minor comments (2)

[Abstract] The abstract contains the placeholder 'DITRON' and 'ours{}'; ensure consistent naming and expansion of the system name throughout the manuscript.
[Evaluation] Figures and tables in the evaluation would benefit from explicit captions detailing the exact configurations and baselines used for each speedup bar.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript on DITRON. The feedback highlights important aspects of verifiability and attribution of results, which we address below. We provide point-by-point responses to the major comments and outline planned revisions to strengthen the paper.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: The specific quantitative claims (6%-30% kernel speedups, 5%-30% vLLM gains, >10% MFU improvement, and 500,000 GPU hours/month savings) are stated without any description of experimental methodology, chosen baselines (e.g., exact cuBLAS/NCCL versions or hand-tuned implementations), hardware configurations, number of runs, error bars, or data availability. This directly undermines verifiability of the central performance claims.

Authors: We agree that the abstract and evaluation section would benefit from greater methodological transparency to support verifiability. In the revised manuscript, we will expand the evaluation section with a dedicated 'Experimental Methodology' subsection. This will specify hardware details (e.g., cluster node counts, GPU models such as NVIDIA H100 and AMD MI250X), exact baseline versions (cuBLAS 12.4, NCCL 2.19, and descriptions of any hand-tuned references), number of runs (at least 10 per measurement with reported means and standard deviations), and inclusion of error bars in figures. We will also add a statement on data availability, committing to release benchmark scripts and aggregated results via a public repository upon acceptance. The abstract claims will remain high-level but will be explicitly cross-referenced to this expanded setup. revision: yes
Referee: [Core-Device-Task abstraction and Evaluation] The Core-Device-Task abstraction (described in the main technical sections): The paper asserts that this three-level hierarchy automatically supports diverse parallelism strategies and maps programs onto distributed hardware with low overhead and minimal expert tuning. However, the evaluation provides no ablations isolating the abstraction's contribution, no counter-examples on irregular tiling or heavy cross-device communication patterns, and no discussion of whether schedule annotations or manual fusion decisions are still required. Without these, the 6-30% speedups cannot be confidently attributed to the proposed abstraction rather than low-level codegen.

Authors: This observation is valid and points to a gap in the current evaluation. While the manuscript demonstrates overall performance and cross-platform results, it does not isolate the hierarchical abstraction's specific contributions through ablations. In the revision, we will add targeted ablations (e.g., comparing full Core-Device-Task against flattened two-level variants) and include results on irregular tiling and high-communication workloads to show where the abstraction provides benefits or encounters limits. We will also clarify the role of user-provided annotations: the core mapping, tiling decisions, and communication abstraction are automated, though optional high-level hints can guide fusion for further gains. These additions will help attribute the observed speedups more directly to the proposed abstraction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical compiler evaluation with no derivations or fitted predictions

full rationale

The paper introduces a hierarchical Core-Device-Task abstraction for a tensor compiler and reports performance results from kernel benchmarks, vLLM inference, and production deployment. No equations, first-principles derivations, parameter fitting, or predictions appear in the provided text. Claims rest on measured speedups and MFU gains rather than any reduction of outputs to inputs by construction. Self-citations are not load-bearing here, and the central claims do not collapse to tautologies or renamed inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on any free parameters, axioms, or invented entities; the focus is on the compiler abstraction and empirical results.

pith-pipeline@v0.9.0 · 5625 in / 1281 out tokens · 61812 ms · 2026-05-10T15:54:37.187903+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 21 canonical work pages · 11 internal anchors

[1]

URLhttps://github.com/ROCm/rocSHMEM

AMD. URLhttps://github.com/ROCm/rocSHMEM
[2]

Claude 3.5 sonnet.https://www.anthropic.com/news/claude-3-5-sonnet, 2024

Anthropic. Claude 3.5 sonnet.https://www.anthropic.com/news/claude-3-5-sonnet, 2024

2024
[3]

Iris: First-class multi-GPU programming experience in Triton, 2025

Muhammad Awad, Muhammad Osama, and Brandon Potter. Iris: First-class multi-GPU programming experience in Triton, 2025

2025
[4]

Flux: Fast software-based communication overlap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858,

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu. FLUX: fast software-based communication overlap on gpus through kernel fusion. CoRR, abs/2406.06858, 2024. doi: 10.48550/ARXIV.2406.06858. URL https: //doi.org/10.48550/arXiv.2406.06858

work page doi:10.48550/arxiv.2406.06858 2024
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, Hao Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian L...

work page internal anchor Pith review doi:10.48550/arxiv.2405.04434 2024
[7]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
[8]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[9]

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

Raja Gond, Nipun Kwatra, and Ramachandran Ramjee. Tokenweave: Efficient compute-communication overlap for distributed llm inference, 2025. URLhttps://arxiv.org/abs/2505.11329

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Pallas, 2025

Google. Pallas, 2025. URLhttps://docs.jax.dev/en/latest/pallas/index.html

2025
[11]

Breaking the computation and communication abstraction barrier in distributed machine learning workloads

Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. Breaking the computation and communication abstraction barrier in distributed machine learning workloads. In Babak Falsafi, Michael Ferdman, Shan Lu, and Thomas F. Wenisch, editors,ASPLOS ’22:27th ACM Inter...

work page doi:10.1145/3503222.3507778 2022
[12]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subrama- nian, Sophia Yang, Szymon Antoniak, T...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.04088 2024
[13]

In: Proceedings of the 29th Symposium on Operating Systems Principles

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors,Proceedings of the 29th Symposium on Operating Systems P...

work page doi:10.1145/3600006.3613165 2023
[14]

Kernelevolve: Scaling agentic kernel coding for heterogeneous ai accelerators at meta, 2025

Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, Zewei Jiang, Dianshi Li, Uladzimir Pashkevich, Varna Puvvada, Feng Shi, Matt Steiner, Ruichao Xiao, Nathan Yan, Xiayu Yu, Zhou Fang, Abdul Zainul-Abedin, Ketan Singh, Hongtao Yu, Wenyuan Chi, Barney Huang, Sean Zhang, Noah Welle...

work page arXiv 2025
[15]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

work page internal anchor Pith review doi:10.48550/arxiv.2502.16982 2025
[16]

Efficient large-scale language model training on GPU clusters using megatron-lm

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on GPU clusters using megatron-lm. In Bronis R. deSupinski, MaryW.Hall, andToddGamblin, editors, ...

2021
[17]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

doi: 10.1145/3458817.3476209. URLhttps://doi.org/10.1145/3458817.3476209

work page doi:10.1145/3458817.3476209
[18]

Nvidia nvswitch: Technical overview

NVIDIA. Nvidia nvswitch: Technical overview. Technical report, NVIDIA, 2018. URLhttps://images.nvidia. com/content/pdf/nvswitch-technical-overview.pdf

2018
[19]

cuBLAS, 2022

NVIDIA. cuBLAS, 2022. URLhttps://developer.nvidia.com/cublas

2022
[20]

Hopper architecture whitepaper

NVIDIA. Hopper architecture whitepaper. Technical report, NVIDIA, 2023. URLhttps://resources.nvidia. com/en-us-tensor-core/gtc22-whitepaper-hopper. 12

2023
[21]

Nvidia collective communications library.https://developer.nvidia.com/nccl, 2024

NVIDIA. Nvidia collective communications library.https://developer.nvidia.com/nccl, 2024

2024
[22]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[23]

Qwen2.5 Technical Report

Qwen-Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Morgane Rivière, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momch...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118 2024
[25]

Msccl++: Rethinking gpu communication abstractions for cutting-edge ai applications, 2025

Aashaka Shah, Abhinav Jangda, Binyang Li, Caio Rocha, Changho Hwang, Jithin Jose, Madan Musuvathi, Olli Saarikivi, Peng Cheng, Qinghua Zhou, Roshan Dathathri, Saeed Maleki, and Ziyue Yang. Msccl++: Rethinking gpu communication abstractions for cutting-edge ai applications, 2025. URLhttps://arxiv.org/abs/2504.09014

work page arXiv 2025
[26]

Look ma, no bubbles! designing a low-latency megakernel for LLAMA-1B.https://hazyresearch.stanford

Benjamin Spector, Jordan Juravsky, Stuart Sul, Owen Dugan, Dylan Lim, Dan Fu, Simran Arora, and Chris Ré. Look ma, no bubbles! designing a low-latency megakernel for LLAMA-1B.https://hazyresearch.stanford. edu/blog/2025-05-27-no-bubbles, 2025. Hazy Research Blog

2025
[27]

Tilelang, 2025

TileLang-Team. Tilelang, 2025. URLhttps://github.com/tile-ai/tilelang

2025
[28]

Philippe Tillet, Hsiang-Tsung Kung, and David D. Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Tim Mattson, Abdullah Muzahid, and Armando Solar-Lezama, editors, Proceedings of the 3rd ACM SIGPLAN International Workshopon Machine Learning and Programming Languages, MAPL@PLDI 2019, Phoenix, AZ, USA, June 22, 20...

work page doi:10.1145/3315508.3329973 2019
[29]

Overlap communication with dependent computation via decomposition in large deep learning models

ShiboWang, JinliangWei, AmitSabne, AndyDavis, BerkinIlbeyi, BlakeHechtman, DehaoChen, KarthikSrinivasa Murthy, Marcello Maggioni, Qiao Zhang, Sameer Kumar, Tongfei Guo, Yuanzhong Xu, and Zongwei Zhou. Overlap communication with dependent computation via decomposition in large deep learning models. In Tor M. Aamodt, Natalie D. Enright Jerger, and Michael M...

work page doi:10.1145/3567955.3567959 2023
[30]

Mirage: A multi-level superoptimizer for tensor programs

Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: A multi-level superoptimizer for tensor programs. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), Boston, MA, July 2025. USENIX Association

2025
[31]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[32]

Comet: Fine-grained computation-communication overlapping for mixture-of-experts.arXiv preprint arXiv:2502.19811, 2025

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao 13 Cui, Size Zheng, Li-Wen Chang, Quan Chen, and Xin Liu. Comet: Fine-grained computation-communication overlapping for mixture-of-experts. CoRR, abs/2502.19811, 2025. doi: 10.48550/ARXIV.2502.19811. URL https://doi.org/10.48550/arXiv.2502.19811

work page doi:10.48550/arxiv.2502.19811 2025
[33]

Deepep: an efficient expert-parallel communication library.https://github.com/deepseek-ai/ DeepEP, 2025

Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, and Liang Zhao. Deepep: an efficient expert-parallel communication library.https://github.com/deepseek-ai/ DeepEP, 2025

2025
[34]

Zheng, J

Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, and Xin Liu. Tilelink: Generating efficient compute-communication overlapping kernels using tile-centric primitives, 2025. URLhttps://arxiv.org/abs/2503.20313. 14 0123456789101112131415 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14...

work page arXiv 2025
[35]

c o n s t e x p r = 32 55N U M _ T H R E A D S : tl

: 52 53 54W A R P _ S I Z E : tl . c o n s t e x p r = 32 55N U M _ T H R E A D S : tl . c o n s t e x p r = n u m _ w a r p s * W A R P _ S I Z E 56s c o r e b o a r d = S c o r e b o a r d ( task_deps_ptr , INT_PER_DEPS , scoreboard_ptr , MAX_TASK_ID , M A X _ N U M _ T I L E S _ P E R _ O P , tl . c o n s t e x p r (1) , N U M _ T H R E A D S ) 57sm_id...

2048