Glow: Graph Lowering Compiler Techniques for Neural Networks

Nadav Rotem , Jordan Fix , Saleem Abdulrasool , Garret Catron , Summer Deng , Roman Dzhabarov , Nick Gibson , James Hegeman

show 10 more authors

Meghan Lele Roman Levenstein Jack Montgomery Bert Maher Satish Nadathur Jakob Olesen Jongsoo Park Artem Rakhov Misha Smelyanskiy Man Wang

Authors on Pith no claims yet

classification 💻 cs.PL

keywords compilerglowhardwareintermediateloweringnumberrepresentationtargets

0 comments

read the original abstract

This paper presents the design of Glow, a machine learning compiler for heterogeneous hardware. It is a pragmatic approach to compilation that enables the generation of highly optimized code for multiple targets. Glow lowers the traditional neural network dataflow graph into a two-phase strongly-typed intermediate representation. The high-level intermediate representation allows the optimizer to perform domain-specific optimizations. The lower-level instruction-based address-only intermediate representation allows the compiler to perform memory-related optimizations, such as instruction scheduling, static memory allocation and copy elimination. At the lowest level, the optimizer performs machine-specific code generation to take advantage of specialized hardware features. Glow features a lowering phase which enables the compiler to support a high number of input operators as well as a large number of hardware targets by eliminating the need to implement all operators on all targets. The lowering phase is designed to reduce the input space and allow new hardware backends to focus on a small number of linear algebra primitives.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows
cs.CL 2026-05 unverdicted novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels
cs.PL 2026-04 unverdicted novelty 7.0

Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
Forge-UGC: FX optimization and register-graph engine for universal graph compiler
cs.AR 2026-04 unverdicted novelty 5.0

Forge-UGC delivers a hardware-agnostic four-phase compiler for transformers that reduces compilation time by 6.9-9.2x, inference latency by 18-36%, and energy use by 30-41% on NPU hardware compared with existing frameworks.
From LLM to Silicon: RL-Driven ASIC Architecture Exploration for On-Device AI Inference
cs.AR 2026-04 unverdicted novelty 5.0

An RL agent using Soft Actor-Critic with Mixture-of-Experts jointly optimizes ASIC architecture, memory hierarchy, and partitioning for AI inference, achieving 29809 tokens/s for Llama 3.1 at 3nm and under 13mW for Sm...