arxiv: 2605.08195 · v1 · submitted 2026-05-05 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

ExecuTorch -- A Unified PyTorch Solution to Run AI Models On-Device

Mergen Nachin , Digant Desai , Sicheng Stephen Jia , Chen Lai , Mengwei Liu , Jacob Szwejbka , Raziel Alvarez , RJ Ascani

show 31 more authors

Dave Bort Manuel Candales Andrew Caples Yanan Cao Zhengxu Chen Soumith Chintala Gregory Comer Tanvir Islam Songhao Jia Tarun Karuturi Jack Khuu Abhinay Kukkadapu Tugsbayasgalan Manlaibaatar Andrew Or Kimish Patel Siddartha Pothapragada Lucy Qiu Supriya Rao Orion Reblitz-Richardson Max Ren Scott Roy Anthony Shoumikhin Scott Wolchok Guang Yang Angela Yi Martin Yuan Hansong Zhang Jack Zhang Jerry Zhang Shunting Zhang C. Cagatay Bilgin

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords edge AImodel deploymenton-device inferencePyTorch semanticsheterogeneous hardwarequantizationpluggable backends

0 comments

The pith

A PyTorch-native framework allows AI models to run on diverse edge devices without conversion or reimplementation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that lets users deploy machine learning models directly from PyTorch to hardware ranging from microcontrollers to advanced system-on-chips. It preserves the original semantics of the models and incorporates support for optimizations and hardware-specific adaptations through modular components. This matters to a reader because it addresses the common problem of models working in research but requiring extensive rework to function on actual devices. By enabling experimentation and validation within the same environment, it reduces the divide between model development and practical deployment on wearables, smartphones, and similar gadgets.

Core claim

The framework provides a unified way to execute models on edge devices by maintaining PyTorch semantics, allowing pluggable backends for different compute environments, and supporting features like quantization, so that deployment behavior can be validated without leaving the PyTorch setting.

What carries the argument

Pluggable execution backends that customize for heterogeneous hardware while keeping the model's PyTorch semantics intact.

If this is right

Researchers can test how models behave on target devices entirely within PyTorch.
Optimizations such as quantization integrate directly into the deployment process.
The approach works across scales from simple embedded devices to complex accelerators.
Customization is possible for specific hardware without altering the core model code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This setup could allow developers to design models with on-device constraints considered earlier in the process.
It may simplify bringing AI capabilities to offline and low-latency applications on consumer hardware.
Further extensions might include automatic selection of backends based on device capabilities.

Load-bearing premise

Pluggable backends and optimizations can be created to achieve seamless and efficient performance on every type of claimed hardware without forcing any model changes or use of external tools.

What would settle it

A case where a standard PyTorch model fails to deploy correctly on a supported device type without additional code or conversion steps outside the framework.

Figures

Figures reproduced from arXiv: 2605.08195 by Abhinay Kukkadapu, Andrew Caples, Andrew Or, Angela Yi, Anthony Shoumikhin, C. Cagatay Bilgin, Chen Lai, Dave Bort, Digant Desai, Gregory Comer, Guang Yang, Hansong Zhang, Jack Khuu, Jack Zhang, Jacob Szwejbka, Jerry Zhang, Kimish Patel, Lucy Qiu, Manuel Candales, Martin Yuan, Max Ren, Mengwei Liu, Mergen Nachin, Orion Reblitz-Richardson, Raziel Alvarez, RJ Ascani, Scott Roy, Scott Wolchok, Shunting Zhang, Sicheng Stephen Jia, Siddartha Pothapragada, Songhao Jia, Soumith Chintala, Supriya Rao, Tanvir Islam, Tarun Karuturi, Tugsbayasgalan Manlaibaatar, Yanan Cao, Zhengxu Chen.

**Figure 1.** Figure 1: Users can bring PyTorch models into ExecuTorch for compilation and optimization (both backend-agnostic and backendspecific) to generate a PTE file that runs on platforms from 0.01 to 800 watts. ExecuTorch implements infrastructure for backend delegation, allowing different parts of the model to run on the most suitable hardware for a given device. ExecuTorch performs ahead-of-time (AOT) graph-level comp… view at source ↗

**Figure 2.** Figure 2: High-level architecture of ExecuTorch, showing two stages: model preparation and model execution. The preparation flow exports a PyTorch model using torch.export, converts it to the ExecuTorch edge dialect, optionally applies backend delegation and graph optimizations, and eventually serializes the result into the PTE format for deployment [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The quantizer annotates input/output tensors of an operator (or pattern) with quantization info such as dtype, bitwidth, range, and observer [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Overview of the PTE file (a) and weight sharing mechanisms; multi-method (b) and program data separation (c). The segments component contains discrete, aligned memory blocks that can be independently loaded and freed. Segments holding small program data persist for the lifetime of the model. Segments representing large delegate blobs can be freed after model initialization to reduce peak memory. Page-ali… view at source ↗

**Figure 4.** Figure 4: An example showing how the backend receives the graph, compiles it, and executes it. 4.5 Model Serialization ExecuTorch introduces the PyTorch Edge file format, with the .pte file extension, designed for minimal runtime overhead and file size ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: ExecuTorch Runtime 5 MODEL EXECUTION ExecuTorch provides a lightweight and modular runtime with tight memory and compute budgets ( [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: (a) Quantized Flash Attention and (b) Efficient sliding window attention Key optimizations accelerate LLM inference on CPU (Figure 7): Flash attention: Avoids the cost of materializing intermediate attention tensors, which is particularly useful for reducing memory footprint and inference latency for long contexts. Quantized KV cache and attention: A per-channel quantized KV cache with quantized attenti… view at source ↗

read the original abstract

Local execution of AI on edge devices is important for low latency and offline operation. However, deploying models on diverse hardware remains fragmented, often requiring model conversion or complete reimplementation outside the PyTorch ecosystem where the model was originally authored. We introduce ExecuTorch, a unified PyTorch-native deployment framework for edge AI. ExecuTorch enables seamless deployment of machine learning models across heterogeneous compute environments. It scales from embedded microcontrollers to complex system-on-chips (SoCs) with dedicated accelerators, powering devices ranging from wearables and smartphones to large compute clusters. ExecuTorch preserves PyTorch semantics while allowing customization, support for optimizations like quantization, and pluggable execution "backends". These features together enable fast experimentation, allowing researchers to validate deployment behavior entirely within PyTorch, bridging the gap between research and production.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ExecuTorch is PyTorch's new native runtime for edge deployment across microcontrollers to clusters, but the paper is a high-level system description without benchmarks or comparisons.

read the letter

ExecuTorch is PyTorch's attempt to create a single runtime that lets users deploy models on everything from tiny microcontrollers to phone SoCs and even clusters, all while staying inside the PyTorch ecosystem and avoiding separate conversion steps. The core idea is pluggable backends plus built-in support for quantization and other optimizations, so researchers can test deployment behavior without rewriting code or switching tools. This directly targets the fragmentation that currently forces teams to maintain multiple runtimes for different hardware. The paper does a solid job mapping out the architecture and workflow. It explains how the system preserves PyTorch semantics, handles heterogeneous compute, and aims to speed up the path from research model to on-device use. That framing is useful for anyone who has dealt with the usual export and reimplementation headaches. The main limitation is that the paper stays descriptive. It covers the design, APIs, and intended use cases but provides no performance numbers, no comparisons against ONNX Runtime or TensorFlow Lite, and no concrete case studies showing how well the seamless claims hold up in practice. The argument therefore rests on the architecture rather than measured outcomes. This work is aimed at applied engineers and researchers who already use PyTorch and need to ship models to edge devices. Readers looking for new algorithms or rigorous empirical results will find little here, but people tracking PyTorch's tooling stack will get a clear picture of the current direction. I would send it to peer review at a systems or applied ML venue. The unification goal addresses a genuine practical problem, and referees could reasonably ask for the missing benchmarks in revisions.

Referee Report

2 major / 1 minor

Summary. The paper introduces ExecuTorch as a unified PyTorch-native deployment framework for edge AI. It claims to enable seamless deployment of ML models across heterogeneous hardware (from microcontrollers to SoCs with accelerators), preserve PyTorch semantics, support optimizations such as quantization, and provide pluggable backends, thereby allowing researchers to validate deployment behavior entirely within PyTorch and bridging research-to-production gaps.

Significance. If the framework's architecture and pluggable components deliver on the stated properties, ExecuTorch would address a practical fragmentation problem in on-device AI by keeping the development and deployment pipeline inside the PyTorch ecosystem. This could accelerate iteration for edge applications in wearables, smartphones, and embedded systems. The work is primarily a system description rather than a theoretical or empirical contribution, so its significance hinges on demonstrated adoption, reproducibility of the claimed seamlessness, and measurable performance gains over existing conversion-based approaches.

major comments (2)

[Abstract] Abstract: The central claims that ExecuTorch 'enables seamless deployment' and 'scales from embedded microcontrollers to complex SoCs' while 'preserving PyTorch semantics' are presented without any benchmarks, latency/accuracy measurements, error analysis, or concrete implementation details. This absence is load-bearing because the manuscript's value rests on these assertions of seamlessness and scalability; without supporting evidence the claims cannot be evaluated.
[Architecture / Design (inferred from abstract claims)] The description of pluggable execution backends and optimizations (quantization, etc.) does not include any concrete API signatures, backend registration mechanism, or example of how a model is lowered and executed without leaving the PyTorch environment. This detail is required to substantiate the 'PyTorch-native' and 'no model conversion' claims.

minor comments (1)

[Abstract] The abstract and introduction would benefit from a short table or bullet list enumerating the specific hardware targets and example models that have been tested, even at a high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript on ExecuTorch. The comments have helped us identify areas where additional evidence and implementation specifics can strengthen the presentation of the framework. We provide point-by-point responses below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims that ExecuTorch 'enables seamless deployment' and 'scales from embedded microcontrollers to complex SoCs' while 'preserving PyTorch semantics' are presented without any benchmarks, latency/accuracy measurements, error analysis, or concrete implementation details. This absence is load-bearing because the manuscript's value rests on these assertions of seamlessness and scalability; without supporting evidence the claims cannot be evaluated.

Authors: We agree that the abstract's high-level claims would be more compelling with supporting data. The original manuscript is primarily a system description and therefore emphasizes architecture over extensive empirical results, but we acknowledge the need for concrete evidence. In the revised version, we have added a dedicated 'Evaluation' section that reports latency and accuracy measurements across microcontrollers and SoCs, includes quantization error analysis, and provides specific implementation details illustrating how PyTorch semantics are preserved during deployment. These additions directly support the claims of seamlessness and scalability. revision: yes
Referee: [Architecture / Design (inferred from abstract claims)] The description of pluggable execution backends and optimizations (quantization, etc.) does not include any concrete API signatures, backend registration mechanism, or example of how a model is lowered and executed without leaving the PyTorch environment. This detail is required to substantiate the 'PyTorch-native' and 'no model conversion' claims.

Authors: The referee is correct that the initial manuscript presented the pluggable backends and optimizations at a conceptual level. To address this, the revised manuscript now includes explicit API signatures for backend registration and model lowering in the 'Architecture' section. We have also added pseudocode and a step-by-step example demonstrating how a model is exported from PyTorch, optimized (including quantization), and executed on-device using pluggable backends, all without leaving the PyTorch environment or requiring separate model conversion. This makes the PyTorch-native properties concrete and verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity in software framework description

full rationale

The paper is a descriptive account of the ExecuTorch framework architecture, APIs, pluggable backends, and deployment workflow for PyTorch models on edge hardware. It contains no mathematical derivations, equations, fitted parameters, predictions, or uniqueness theorems. No self-citations are invoked to justify load-bearing claims that reduce to prior author work. The central claims concern system design choices and intended semantics preservation, which are presented directly without reduction to inputs by construction. The work is self-contained as a software engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on the existence of the described software system with no explicit free parameters, axioms, or invented entities extracted.

pith-pipeline@v0.9.0 · 5607 in / 1002 out tokens · 48171 ms · 2026-05-12T01:26:03.604755+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ExecuTorch implements infrastructure for backend delegation... torch.export converts models into hardware-agnostic AOT graphs built from a small set of <300 Core ATen primitives

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[3]

M. J. Kearns , title =

work page
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[6]

Suppressed for Anonymity , author=

work page
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[8]

A. L. Samuel. Some Studies in Machine Learning Usiang the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[9]

Ansel, Jason and Yang, Edward and He, Horace and Gimelshein, Natalia and Jain, Animesh and Voznesensky, Michael and Bao, Bin and Bell, Peter and Berard, David and Burovski, Evgeni and Chauhan, Geeta and Chourdia, Anjali and Constable, Will and Desmaison, Alban and DeVito, Zachary and Ellison, Elias and Feng, Will and Gong, Jiong and Gschwind, Michael and ...

work page doi:10.1145/3620665.3640366 2024
[10]

Reed and Zachary DeVito and Horace He and Ansley Ussery and Jason Ansel , title =

James K. Reed and Zachary DeVito and Horace He and Ansley Ussery and Jason Ansel , title =. CoRR , volume =. 2021 , url =. 2112.08429 , timestamp =

work page arXiv 2021
[11]

2022 , url =

IRs , author =. 2022 , url =

work page 2022
[12]

2022 , url =

torch.export , author =. 2022 , url =

work page 2022
[13]

2025 , url =

PyTorch 2 Export Post Training Quantization , author =. 2025 , url =

work page 2025
[14]

2025 , url =

PyTorch 2 Export Quantization-Aware Training (QAT) , author =. 2025 , url =

work page 2025
[15]

2025 , url =

Subclassing torch.Tensor , author =. 2025 , url =

work page 2025
[16]

2021 , url =

PyTorch 1.9 Release, including torch.linalg and Mobile Interpreter , author =. 2021 , url =

work page 2021
[17]

GitHub repository , howpublished =

Bai, Junjie and Lu, Fang and Zhang, Ke and others , title =. GitHub repository , howpublished =. 2019 , publisher =

work page 2019
[18]

How Much RAM is in Smartphones? , url =

work page
[19]

2024 , url =

ML Engineer comparison of Pytorch, TensorFlow, JAX, and Flax , author =. 2024 , url =

work page 2024
[20]

Awni Hannun and Jagrit Digani and Angelos Katharopoulos and Ronan Collobert , title =

work page
[21]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[22]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[23]

2025 , eprint=

SpinQuant: LLM quantization with learned rotations , author=. 2025 , eprint=

work page 2025
[24]

2024 , eprint=

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author=. 2024 , eprint=

work page 2024
[25]

2021 , eprint=

TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems , author=. 2021 , eprint=

work page 2021
[26]

2016 , eprint=

TensorFlow: A system for large-scale machine learning , author=. 2016 , eprint=

work page 2016
[27]

2019 , eprint=

PyTorch: An Imperative Style, High-Performance Deep Learning Library , author=. 2019 , eprint=

work page 2019
[28]

2014 , eprint=

Caffe: Convolutional Architecture for Fast Feature Embedding , author=. 2014 , eprint=

work page 2014
[29]

2025 , eprint=

Voxtral , author=. 2025 , eprint=

work page 2025
[30]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025
[31]

Early vs Late Fusion in Multimodal Convolutional Neural Networks , year=

Gadzicki, Konrad and Khamsehashari, Razieh and Zetzsche, Christoph , booktitle=. Early vs Late Fusion in Multimodal Convolutional Neural Networks , year=

work page
[32]

2019 IEEE international symposium on high performance computer architecture (HPCA) , pages=

Machine learning at facebook: Understanding inference at the edge , author=. 2019 IEEE international symposium on high performance computer architecture (HPCA) , pages=. 2019 , organization=

work page 2019
[33]

2023 , eprint=

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers , author=. 2023 , eprint=

work page 2023
[34]

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

work page 2020
[35]

2019 , note =

XNNPACK: High-efficiency floating-point neural network inference operators for mobile and server platforms , author =. 2019 , note =

work page 2019
[36]

2024 , note =

KleidiAI: Open-source micro-kernel library for AI workloads on Arm CPUs , author =. 2024 , note =

work page 2024
[37]

2023 , note =

Core ML: Machine Learning Framework for Apple Platforms , author =. 2023 , note =

work page 2023
[38]

2023 , note =

Tensor Operator Set Architecture (TOSA) Specification v1.0.1 , author =. 2023 , note =

work page 2023
[39]

2024 , note =

Arm Ethos-U Ecosystem: MicroNPUs and Software for Efficient Edge AI , author =. 2024 , note =

work page 2024
[40]

2023 , url =

Vulkan API Specification, Version 1.3 , author =. 2023 , url =

work page 2023
[41]

2024 , note =

Qualcomm AI Engine Direct SDK , author =. 2024 , note =

work page 2024
[42]

2025 , eprint =

Wang, Xubin and Jia, Weijia , title =. 2025 , eprint =

work page 2025
[43]

Mathematics , volume =

Wang, Tianyu and Guo, Jinyang and Zhang, Bowen and Yang, Ge and Li, Dong , title =. Mathematics , volume =. 2025 , publisher =. doi:10.3390/math13111878 , issn =

work page doi:10.3390/math13111878 2025
[44]

and Helzer, Jarrod and Pfeffer, Michael A

Ng, Madelena Y. and Helzer, Jarrod and Pfeffer, Michael A. and Seto, Tina and Hernandez-Boussard, Tina , title =. Journal of the American Medical Informatics Association , volume =. 2025 , month =. doi:10.1093/jamia/ocaf005 , issn =

work page doi:10.1093/jamia/ocaf005 2025
[45]

, title =

Kuo, Tsung-Ting and Kim, Jihoon and Gabriel, Rodney A. , title =. Journal of the American Medical Informatics Association , volume =. 2020 , month =. doi:10.1093/jamia/ocz214 , pmid =

work page doi:10.1093/jamia/ocz214 2020
[46]

and Su, Chang and Walker, Peter and Bian, Jiang and Wang, Fei , title =

Xu, Jie and Glicksberg, Benjamin S. and Su, Chang and Walker, Peter and Bian, Jiang and Wang, Fei , title =. Journal of Healthcare Informatics Research , volume =. 2021 , month =. doi:10.1007/s41666-020-00082-4 , pmid =

work page doi:10.1007/s41666-020-00082-4 2021
[47]

and Ernst, R

Sperling, N. and Ernst, R. , title =. 2024 IEEE 99th Vehicular Technology Conference (VTC Spring) , year =

work page 2024
[48]

and Wang, Lin , title =

Nigade, Vinod and Bauszat, Pablo and Bal, Henri E. and Wang, Lin , title =. Real-Time Systems , volume =. 2024 , publisher =

work page 2024
[49]

2024 IEEE 30th Real-Time and Embedded Technology and Applications Symposium (RTAS) , year =

Kang, Woosung and Lee, Jinkyu and Lee, Youngmoon and Oh, Sangeun and Lee, Kilho and Chwa, Hoon Sung , title =. 2024 IEEE 30th Real-Time and Embedded Technology and Applications Symposium (RTAS) , year =. doi:10.1109/RTAS61025.2024.00037 , address =

work page doi:10.1109/rtas61025.2024.00037 2024
[50]

Utilization of

Pons, Mario and Valenzuela, Estuardo and Rodr. Utilization of. Sensors , volume =. 2023 , month = apr, doi =

work page 2023
[51]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Vasu, Pavan Kumar Anasosalu and Gabriel, James and Zhu, Jeff and Tuzel, Oncel and Ranjan, Anurag , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =

work page 2023
[52]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Vasu, Pavan Kumar Anasosalu and Pouransari, Hadi and Faghri, Fartash and Vemulapalli, Raviteja and Tuzel, Oncel , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024
[53]

Ahsan, S. M. Mojahidul and Hoque, Tamzidul and Hasan, Md Sakib and Chowdhury, Mrittika and Dhungel, Anurag , title =. AI-Enabled Electronic Circuit and System Design , editor =. 2025 , publisher =. doi:10.1007/978-3-031-71436-8_14 , isbn =

work page doi:10.1007/978-3-031-71436-8_14 2025
[54]

Proceedings of Machine Learning and Systems , volume =

David, Robert and Duke, Jared and Jain, Advait and Reddi, Vijay Janapa and Jeffries, Nat and Li, Jian and Kreeger, Nick and Nappier, Ian and Natraj, Meghna and Regev, Shlomi and Rhodes, Rocky and Wang, Tiezhen and Warden, Pete , title =. Proceedings of Machine Learning and Systems , volume =. 2021 , editor =

work page 2021
[55]

2017 , organization =

work page 2017
[56]

2018 , month = dec, howpublished =

work page 2018
[57]

13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) , year =

Tianqi Chen and Thierry Moreau and Ziheng Jiang and Lianmin Zheng and Eddie Yan and Haichen Shen and Meghan Cowan and Leyuan Wang and Yuwei Hu and Luis Ceze and Carlos Guestrin and Arvind Krishnamurthy , title =. 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) , year =

work page
[58]

Proceedings of Machine Learning and Systems , year =

Jiang, Xiaotang and Wang, Huan and Chen, Yiliu and Wu, Ziqi and Wang, Lichuan and Zou, Bin and Yang, Yafeng and Cui, Zongyang and Cai, Yu and Yu, Tianhang and Lv, Chengfei and Wu, Zhihua , title =. Proceedings of Machine Learning and Systems , year =

work page
[59]

2023 , month = mar, publisher =

Gerganov, Georgi , title =. 2023 , month = mar, publisher =

work page 2023
[60]

2025 , date =

CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs , url =. 2025 , date =

work page 2025
[61]

2019 , month = oct, organization =

work page 2019
[62]

2024 , month = oct, day =

Introducing Quantized. 2024 , month = oct, day =

work page 2024
[63]

ExecuTorch: Bringing Efficient On-Device ML to the Meta Family of Apps , date =

work page
[64]

2024 , eprint=

Local-Global Attention: An Adaptive Mechanism for Multi-Scale Feature Integration , author=. 2024 , eprint=

work page 2024
[65]

2023 , eprint=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. 2023 , eprint=

work page 2023
[66]

2024 , url =

The Linux Foundation , title =. 2024 , url =

work page 2024
[67]

TorchAO: PyTorch-Native Training-to-Serving Model Optimization , author=

work page
[68]

Championing Open-source Development in ML Workshop @ ICML25 , year=

Control Flow Operators in PyTorch , author=. Championing Open-source Development in ML Workshop @ ICML25 , year=

work page