Recognition: unknown
VDCores: Resource Decoupled Programming and Execution for Asynchronous GPU
Pith reviewed 2026-05-08 17:05 UTC · model grok-4.3
The pith
VDCores abstracts asynchronous GPU hardware as isolated virtual cores linked by micro-operation dependencies to enable automatic overlap and higher utilization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VDCores presents a decoupled programming and execution model for asynchronous GPUs. It abstracts asynchronous hardware execution units as resource isolated virtual cores and represents workloads as dependency-connected micro-operations. This abstraction removes static orchestration from the programmer, enables automatic overlap of memory and compute based on dependency and resource readiness, and thereby improves utilization of asynchronous hardware resources. Realizing the model on current GPUs uses a specialized programming model and runtime that keeps flexibility high and overhead low.
What carries the argument
The Virtual Decoupled Engines (VDCores) abstraction, which isolates asynchronous hardware units as virtual cores and connects workloads through dependency-linked micro-operations so the runtime can schedule them automatically.
If this is right
- Decoding throughput improves by 24% on average across the tested LLM workloads and GPU platforms.
- Gains reach as high as 77% when input sizes vary dynamically during execution.
- Kernel programming and specialization effort drops by 90% because static orchestration is no longer required.
- Asynchronous hardware units achieve higher utilization through automatic overlap driven by dependency and resource readiness.
Where Pith is reading between the lines
- The micro-operation dependency representation may become a useful intermediate form for future GPU compilers and schedulers.
- Similar resource-decoupling ideas could be adapted to other accelerators that expose multiple asynchronous engines.
- Wider adoption might shift GPU software stacks toward higher-level workload descriptions that hide hardware details.
Load-bearing premise
A GPU-specialized programming model and runtime can realize the decoupled virtual-core abstraction efficiently on today's hardware while preserving flexibility and incurring only minimal overhead.
What would settle it
Running the four LLM inference workloads on GH200, H100, and RTX 6000 Pro GPUs and measuring no average throughput gain or even a slowdown relative to standard monolithic kernels would falsify the central performance claim.
Figures
read the original abstract
Modern GPUs increasingly rely on specialized and asynchronous hardware units to deliver high performance. Yet these units are often underutilized because today's GPU software stacks still organize programming and execution around a monolithic kernel model that mismatches asynchronous hardware. To address this issue, Virtual Decoupled Engines (VDCores) presents a new decoupled programming and execution model for asynchronous GPUs. VDCores abstracts asynchronous hardware execution units as resource isolated virtual cores and represents workloads as dependency-connected micro-operations (micro-ops). this abstraction removes static orchestration from the programmer, enables automatic overlap of memory and compute based on dependency and resource readiness, and thereby improves utilization of asynchronous hardware resources. Realizing such a decoupled abstraction efficiently on today's GPUs is itself challenging, VDCores addresses this through a GPU-specialized programming model and GPU runtime design that preserves the flexibility while minimizing implementation overhead. Across four LLM inference workloads on GH200, H100, and RTX 6000 Pro GPUs, VDCores significantly improves decoding throughput by 24% on average and by up to 77% under dynamic inputs, while reducing kernel programming and specialization effort by 90%. We have open sourced VDCores at https://github.com/vdcores/vdcores.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Virtual Decoupled Engines (VDCores), a new programming and execution model for asynchronous GPUs. It abstracts specialized hardware units as resource-isolated virtual cores and represents workloads as dependency graphs of micro-operations (micro-ops). This removes static orchestration from the programmer, enables automatic overlap of memory and compute operations based on readiness, and is realized via a GPU-specialized programming model and runtime that aims to minimize overhead. The central empirical claim is that, across four LLM inference workloads on GH200, H100, and RTX 6000 Pro GPUs, VDCores delivers 24% average (up to 77% peak) decoding throughput improvement while reducing kernel programming and specialization effort by 90%. The implementation is open-sourced.
Significance. If the performance and effort-reduction claims hold after proper isolation of overheads and clarification of baselines, the work would be significant for the field of GPU systems and programming models. It directly targets the mismatch between monolithic kernel abstractions and modern asynchronous hardware (tensor cores, DMA engines, etc.), which is a growing pain point for high-performance workloads such as LLM inference. The open-sourcing of the code is a clear strength that enables reproducibility and follow-on research. The approach of decoupling via virtual cores and micro-ops could influence future runtime designs if the minimal-overhead realization is convincingly demonstrated.
major comments (3)
- [Evaluation] Evaluation section (likely §5 or §6): The reported 24% average and 77% peak throughput gains are presented as end-to-end numbers without any breakdown or microbenchmark isolating the runtime overhead of the VDCores scheduler, dependency tracking, virtual-core management, and automatic overlap logic. The abstract asserts that the design “minimizes implementation overhead,” yet no comparison against hand-written CUDA graphs or separate accounting of scheduling vs. compute time is provided. This makes it impossible to verify that the net benefit is attributable to the decoupled abstraction rather than other factors, especially under the dynamic-input regime highlighted as the strongest result.
- [Evaluation] Experimental setup and results (likely §5.1–5.3): No details are given on baseline implementations (e.g., whether they use CUDA graphs, manual stream management, or existing frameworks), measurement methodology, error bars, number of runs, or precise criteria for selecting the four LLM inference workloads and dynamic-input scenarios. Without these, the concrete throughput numbers cannot be independently verified or compared, weakening the central empirical claim.
- [Abstract / Evaluation] Programming-effort claim (abstract and likely §4 or §5): The 90% reduction in “kernel programming and specialization effort” is stated without describing how effort was quantified (lines of code, developer time, number of specialized kernels, or subjective assessment). This metric is central to the paper’s value proposition yet lacks an objective measurement protocol or comparison table.
minor comments (2)
- [Abstract] Abstract: The sentence beginning “this abstraction removes…” should be capitalized as a new sentence for readability.
- [Introduction / Background] Notation: The terms “virtual cores,” “micro-ops,” and “VDCores” are introduced without a clear early definition or diagram showing their relationship to physical hardware units.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving the evaluation's rigor and transparency. We have revised the paper to address each point as detailed below.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section (likely §5 or §6): The reported 24% average and 77% peak throughput gains are presented as end-to-end numbers without any breakdown or microbenchmark isolating the runtime overhead of the VDCores scheduler, dependency tracking, virtual-core management, and automatic overlap logic. The abstract asserts that the design “minimizes implementation overhead,” yet no comparison against hand-written CUDA graphs or separate accounting of scheduling vs. compute time is provided. This makes it impossible to verify that the net benefit is attributable to the decoupled abstraction rather than other factors, especially under the dynamic-input regime highlighted as the strongest result.
Authors: We agree that isolating the runtime overheads is crucial to attribute the performance gains correctly to the VDCores model. In the revised manuscript, we have added a dedicated microbenchmark subsection (Section 5.4) that separately measures the time spent on VDCores scheduler, dependency tracking, virtual-core management, and overlap logic. These microbenchmarks demonstrate that the combined overhead is under 4% of total runtime across the tested GPUs, with the benefits stemming primarily from automatic operation overlap enabled by the micro-op dependency graph. We also include side-by-side comparisons with hand-written CUDA graphs for the dynamic-input cases, showing VDCores maintains or exceeds their performance while automating the orchestration. revision: yes
-
Referee: [Evaluation] Experimental setup and results (likely §5.1–5.3): No details are given on baseline implementations (e.g., whether they use CUDA graphs, manual stream management, or existing frameworks), measurement methodology, error bars, number of runs, or precise criteria for selecting the four LLM inference workloads and dynamic-input scenarios. Without these, the concrete throughput numbers cannot be independently verified or compared, weakening the central empirical claim.
Authors: We acknowledge the need for comprehensive experimental details to enable verification. The revised Section 5.1 now specifies that baselines were implemented using CUDA graphs for workloads with static inputs and manual multi-stream management for dynamic scenarios, with no reliance on higher-level frameworks beyond the CUDA runtime. Timing measurements were performed using CUDA events over 10 runs per data point, reporting means with standard deviation error bars in all figures. The four LLM inference workloads were selected to represent a range of model scales and dynamic input patterns typical in LLM decoding. These details have been added to facilitate independent reproduction. revision: yes
-
Referee: [Abstract / Evaluation] Programming-effort claim (abstract and likely §4 or §5): The 90% reduction in “kernel programming and specialization effort” is stated without describing how effort was quantified (lines of code, developer time, number of specialized kernels, or subjective assessment). This metric is central to the paper’s value proposition yet lacks an objective measurement protocol or comparison table.
Authors: The 90% effort reduction is indeed central, and we have clarified its measurement in the revised manuscript. In Section 4.2 and a new Appendix C, we describe an objective protocol based on lines of code (LOC) for kernel definitions, specialization logic, and dependency orchestration. We provide a comparison table showing LOC counts for each workload in the baseline (hand-specialized CUDA kernels and streams) versus VDCores (micro-op definitions and graph specifications), averaging 90% fewer LOC. While developer time is not directly measured, the LOC metric serves as a reproducible proxy, and we note that the reduction arises from eliminating manual overlap code. revision: yes
Circularity Check
No significant circularity; empirical claims rest on benchmarks, not derivations
full rationale
The paper presents a new GPU programming model and runtime (VDCores) whose central claims are performance improvements measured on four LLM workloads across three GPU platforms. No equations, fitted parameters, or first-principles derivations appear in the provided abstract or description. The 24% average / 77% peak throughput gains and 90% reduction in programming effort are reported as direct experimental outcomes rather than quantities predicted from prior fitted values or self-referential definitions. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the core abstraction. The design is therefore self-contained against external benchmarks; any concerns about unisolated overhead belong to evidence strength, not circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Asynchronous GPU hardware units can be treated as independently schedulable resources
invented entities (3)
-
Virtual Decoupled Engines (VDCores)
no independent evidence
-
virtual cores
no independent evidence
-
micro-ops
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Dennis Abts, Jonathan Ross, Jonathan Sparling, Mark Wong-VanHaren, Max Baker, Tom Hawkins, Andrew Bell, John Thompson, Temes- ghen Kahsai, Garrin Kimmell, Jennifer Hwang, Rebekah Leslie-Hurd, Michael Bye, E. R. Creswick, Matthew Boyd, Mahitha Venigalla, Evan Laforge, Jon Purdy, Purushotham Kamath, Dinesh Maheshwari, Michael Beidler, Geert Rosseel, Omar Ah...
2020
-
[2]
Amd cdna architecture: Enabling high-performance compute
AMD. Amd cdna architecture: Enabling high-performance compute. https://www.amd.com/en/technologies/cdna, 2023. Highlights asyn- chronous compute and memory pipelines
2023
-
[3]
Boosting mobile GPU performance with a decoupled access/execute fragment processor
José-María Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis. Boosting mobile GPU performance with a decoupled access/execute fragment processor. InProceedings of the 39th Annual International Symposium on Computer Architecture, ISCA ’12, pages 84–93. IEEE Computer Society, 2012
2012
-
[4]
Arvind and David E. Culler. Dataflow architectures.Annual Review of Computer Science, 1:225–253, 1986
1986
-
[5]
Arvind and Rishiyur S. Nikhil. Executing a program on the MIT tagged-token dataflow architecture.IEEE Transactions on Computers, 39(3):300–318, 1990
1990
-
[6]
Dataflow architectures.Annual Review of Computer Science, 1:225–253, 11 2003
Arvind Arvind and D Culler. Dataflow architectures.Annual Review of Computer Science, 1:225–253, 11 2003
2003
-
[7]
Cudadma: optimiz- ing gpu memory bandwidth via warp specialization
Michael Bauer, Henry Cook, and Brucek Khailany. Cudadma: optimiz- ing gpu memory bandwidth via warp specialization. InProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, Seattle, Washington, 2011. Association for Computing Machinery
2011
-
[8]
Singe: leveraging warp specialization for high performance on gpus
Michael Bauer, Sean Treichler, and Alex Aiken. Singe: leveraging warp specialization for high performance on gpus. InProceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’14, page 119–130, Orlando, Florida, USA, 2014. Association for Computing Machinery
2014
-
[9]
Aws trainium: the journey for designing and optimiza- tion full stack ml hardware
Nafea Bshara. Aws trainium: the journey for designing and optimiza- tion full stack ml hardware. InProceedings of the 29th ACM In- ternational Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 4–4, 2024
2024
-
[10]
Flux: Fast software-based communication overlap on gpus through kernel fusion, 2024
Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu. Flux: Fast software-based communication overlap on gpus through kernel fusion, 2024
2024
-
[11]
Tawa: Automatic warp spe- cialization for modern gpus with asynchronous references
Hongzheng Chen, Bin Fan, Alexander Collins, Bastian Hagedorn, Evghenii Gaburov, Masahiro Masuda, Matthew Brookhart, Chris Sul- livan, Jason Knight, Zhiru Zhang, et al. Tawa: Automatic warp spe- cialization for modern gpus with asynchronous references. In2026 IEEE/ACM International Symposium on Code Generation and Opti- mization (CGO), pages 255–267. IEEE, 2026
2026
-
[12]
Edward Suh
Tao Chen and G. Edward Suh. Efficient data supply for hardware accel- erators with prefetching and access/execute decoupling. InProceedings of the 49th Annual IEEE/ACM International Symposium on Microar- chitecture, MICRO-49, pages 46:1–46:12. IEEE Computer Society, 2016
2016
-
[13]
Tvm: An automated end-to-end optimizing compiler for deep learning
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Yuwei Wang, Yida Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Tvm: An automated end-to-end optimizing compiler for deep learning. InOSDI, Carlsbad, CA, USA, 2018
2018
-
[14]
DianNao: A small-footprint high- throughput accelerator for ubiquitous machine-learning
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. DianNao: A small-footprint high- throughput accelerator for ubiquitous machine-learning. InProceed- ings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pages 269–284. ACM, 2014
2014
-
[15]
DaDianNao: A machine-learning supercomputer
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. DaDianNao: A machine-learning supercomputer. InProceedings of the 47th Annual IEEE/ACM International Symposium on Microarchi- tecture, MICRO-47, pages 609–622. IEEE Computer Society, 2014
2014
-
[16]
Mirage persistent kernel: A compiler and runtime for mega-kernelizing tensor programs, 2025
Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, Hongyi Jin, Bohan Hou, Mengdi Wu, Yixin Dong, Anthony Yip, Zi- hao Ye, Songting Wang, Wenqin Yang, Xupeng Miao, Tianqi Chen, and Zhihao Jia. Mirage persistent kernel: A compiler and runtime for mega-kernelizing tensor programs, 2025
2025
-
[17]
Neal Clayton Crago and Sanjay J. Patel. OUTRIDER: Efficient memory latency tolerance with decoupled strands. InProceedings of the 38th Annual International Symposium on Computer Architecture, ISCA ’11, pages 117–128. ACM, 2011
2011
-
[18]
Dennis and David P
Jack B. Dennis and David P. Misunas. A preliminary architecture for a basic data-flow processor. InProceedings of the 2nd Annual Symposium on Computer Architecture, pages 126–132. ACM, 1975
1975
-
[19]
Groq lpu architecture
Groq Inc. Groq lpu architecture. https://groq.com/architecture/, 2023. Dataflow-style execution with explicit decoupling
2023
-
[20]
Fireiron: A scheduling language for high- performance linear algebra on gpus
Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodik, and Vinod Grover. Fireiron: A scheduling language for high- performance linear algebra on gpus. InPLDI, London, UK, 2020
2020
-
[21]
Aragón, and Margaret Martonosi
Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. DeSC: Decou- pled supply-compute communication management for heterogeneous architectures. InProceedings of the 48th Annual IEEE/ACM Interna- tional Symposium on Microarchitecture, MICRO-48, pages 191–203. ACM, 2015
2015
-
[22]
Flashdecoding++: Faster large language model inference on gpus, 2023
Ke Hong, Guohao Dai, Jiaming Xu, et al. Flashdecoding++: Faster large language model inference on gpus, 2023
2023
-
[23]
LoRA: Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022
2022
-
[24]
Jouppi et al
Norman P. Jouppi et al. In-datacenter performance analysis of a tensor processing unit. InISCA, Toronto, Canada, 2017
2017
-
[25]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), Koblenz, Germany, 2023
2023
-
[26]
Mowry, and Tianqi Chen
Ruihang Lai, Junru Shao, Siyuan Feng, Steven Lyubomirsky, Bohan Hou, Wuwei Lin, Zihao Ye, Hongyi Jin, Yuchen Jin, Jiawei Liu, Lesh- eng Jin, Yaxing Cai, Ziheng Jiang, Yong Wu, Sunghyun Park, Prakalp Srivastava, Jared Roesch, Todd C. Mowry, and Tianqi Chen. Relax: Composable abstractions for end-to-end dynamic machine learning. In ASPLOS, Rotterdam, Nether...
2025
-
[27]
Nvidia hopper architecture in-depth
NVIDIA Corporation. Nvidia hopper architecture in-depth. https:// developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/, 2022. Technical blog
2022
-
[28]
Cutlass: Cuda templates for linear algebra subroutines
NVIDIA Corporation. Cutlass: Cuda templates for linear algebra subroutines. https://github.com/NVIDIA/cutlass, 2023
2023
-
[29]
Nvidia blackwell architecture
NVIDIA Corporation. Nvidia blackwell architecture. https://resources.nvidia.com/en-us-blackwell-architecture?ncid =pa- srch-goog-587708, 2024. White paper
2024
-
[30]
Nvidia tensor cores
NVIDIA Corporation. Nvidia tensor cores. https://www.nvidia.com/en- us/data-center/tensor-cores/, 2024. Accessed: 2026
2024
-
[31]
Cuda c++ programming guide: Asynchronous barriers
NVIDIA Corporation. Cuda c++ programming guide: Asynchronous barriers. https://docs .nvidia.com/cuda/cuda-programming-guide/04- special-topics/async-barriers.html, 2025
2025
-
[32]
Cuda c++ programming guide: Asynchronous data copies
NVIDIA Corporation. Cuda c++ programming guide: Asynchronous data copies. https://docs .nvidia.com/cuda/cuda-programming-guide/ 13 Zijian He, Adrian Sampson, Yiying Zhang, and Zhiyuan Guo 04-special-topics/async-copies.html, 2025. Accessed: 2026
2025
-
[33]
Cutlass blackwell forward at- tention main loop
NVIDIA Corporation. Cutlass blackwell forward at- tention main loop. https://github .com/NVIDIA/cutlass/ blob/a2439551c765c5393aebe557ee75d3a0412d2211/ examples/77_blackwell_fmha/collective/ sm100_fmha_fwd_mainloop_tma_warpspecialized.hpp, 2025. Accessed: 2025-11-20
2025
-
[34]
Papadopoulos and David E
Gregory M. Papadopoulos and David E. Culler. Monsoon: An explicit token-store architecture. InProceedings of the 17th Annual Interna- tional Symposium on Computer Architecture, ISCA ’90, pages 82–91. ACM, 1990
1990
-
[35]
Pytorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad- bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. InAdvances in Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2019
2019
-
[36]
Gonzalez, and Ion Stoica
Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, and Ion Stoica. S-lora: Serving thousands of concurrent lora adapters, 2024
2024
-
[37]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi- billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review arXiv 1909
-
[38]
Ember: A compiler for embed- ding operations on decoupled access-execute architectures
Marco Siracusa, Olivia Hsu, Victor Soria-Pardos, Joshua Randall, Ar- naud Grasset, Eric Biscondi, Doug Joseph, Randy Allen, Fredrik Kjol- stad, Miquel Moretó Planas, et al. Ember: A compiler for embed- ding operations on decoupled access-execute architectures. In2026 IEEE/ACM International Symposium on Code Generation and Opti- mization (CGO), pages 150–1...
2026
-
[39]
James E. Smith. Decoupled access/execute computer architectures. In Proceedings of the 9th Annual Symposium on Computer Architecture, ISCA ’82, pages 112–119. IEEE Computer Society, 1982
1982
-
[40]
Opti- mal software pipelining and warp specialization for tensor core gpus, 2025
Rupanshu Soi, Rohan Yadav, Fredrik Kjolstad, Alex Aiken, Maryam Mehri Dehnavi, Michael Garland, and Michael Bauer. Opti- mal software pipelining and warp specialization for tensor core gpus, 2025
2025
-
[41]
Thunderkittens: Simple, fast, and adorable ai kernels
Benjamin Spector, Jordan Juravsky, Stuart Sul, Owen Dugan, Dylan Lim, Dan Fu, Simran Arora, and Christopher Ré. Thunderkittens: Simple, fast, and adorable ai kernels. InInternational Conference on Learning Representations (ICLR), Vienna, Austria, 2024
2024
-
[42]
Look ma, no bubbles! designing a low-latency megakernel for llama-1b
Benjamin Spector, Jordan Juravsky, Stuart Sul, Owen Dugan, Dy- lan Lim, Dan Fu, Simran Arora, and Christopher Ré. Look ma, no bubbles! designing a low-latency megakernel for llama-1b. https: //hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles, 2025. Tech- nical blog
2025
-
[43]
Cypress: A tile-based dsl for gpu programming
Cypress Team. Cypress: A tile-based dsl for gpu programming. https: //github.com/cypress-dsl/cypress, 2024
2024
-
[44]
Flashinfer: Efficient and flexible inference kernels for large language models
FlashInfer Team. Flashinfer: Efficient and flexible inference kernels for large language models. https://github.com/flashinfer-ai/flashinfer, 2024
2024
-
[45]
Tilelang: A tile-level programming model for deep learning
TileLang Team. Tilelang: A tile-level programming model for deep learning. https://github.com/tile-ai/tilelang, 2024
2024
-
[46]
Philippe Tillet, H. T. Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), Phoenix, AZ, USA, 2019
2019
-
[47]
Decoupled affine computation for SIMT GPUs
Kai Wang and Calvin Lin. Decoupled affine computation for SIMT GPUs. InProceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, pages 295–306. ACM, 2017
2017
-
[48]
Mirage: A multi-level superoptimizer for tensor programs
Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: A multi-level superoptimizer for tensor programs. In19th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI), Boston, MA, USA, 2025
2025
-
[49]
Flashattention-4: Algorithm and kernel pipelin- ing co-design for asymmetric hardware scaling, 2026
Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao. Flashattention-4: Algorithm and kernel pipelin- ing co-design for asymmetric hardware scaling, 2026
2026
-
[50]
Gon- zalez, Ion Stoica, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Joseph E. Gon- zalez, Ion Stoica, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 2024. 14
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.