pith. machine review for the scientific record. sign in

arxiv: 2602.15166 · v2 · submitted 2026-02-16 · 💻 cs.AR

Recognition: no theorem link

Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Design

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:29 UTC · model grok-4.3

classification 💻 cs.AR
keywords tensor accelerator mappingfusion optimizationenergy-delay productautomated mapperpruningtensor algebradata movement scheduling
0
0 comments X

The pith

FFM prunes partial mappings that cannot lead to optimal fusion to search the full fused mapspace for tensor accelerators in feasible time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Fast and Fusiest Mapper (FFM) as a way to automatically discover the best way to schedule data movement and operations on tensor algebra accelerators while keeping intermediate results on-chip through fusion. Prior automated mappers could not explore the full space of fused mappings because the number of candidates grows exponentially with the number of computation steps, forcing designers to rely on hand-tuned fusion choices. FFM eliminates large subsets of mappings shown never to appear in any optimal schedule, then assembles the remaining partial mappings into complete fused schedules. The resulting mappings deliver up to 1.8 times lower energy-delay product than the hand-optimized TransFusion accelerator and run more than 10,000 times faster than TileFlow or SET.

Core claim

FFM shrinks the search space by pruning subsets of mappings that are shown to never be a part of optimal mappings, quickly eliminating all suboptimal mappings containing those partial mappings, then joins the surviving partial mappings to construct optimal fused mappings.

What carries the argument

The Fast and Fusiest Mapper (FFM) pruning rules on partial mappings, which remove any partial schedule proven never to participate in a globally optimal fused mapping before the remaining pieces are joined into complete schedules.

Load-bearing premise

The pruning rules correctly eliminate only those partial mappings that can never belong to an optimal fused schedule for the given workload shapes and fusion semantics.

What would settle it

A workload in which an optimal fused mapping is known to contain a partial mapping that FFM's rules prune would falsify the pruning correctness claim.

Figures

Figures reproduced from arXiv: 2602.15166 by Joel S. Emer, Michael Gilbert, Tanner Andrulis, Vivienne Sze.

Figure 1
Figure 1. Figure 1: An analogy of pmappings as puzzle pieces. (a) Tabs [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Pseudocode of a mapping for two matrix-vector [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example constrained optimization and (b) ex [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) ReservationTree for the mapping in Fig. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the Fast and Fusiest Mapper (FFM). [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Convergence speeds of baselines relative to FFM. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: FFM is fast and per-Einsum runtime remains flat [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 12
Figure 12. Figure 12: (a) FFM reduces off-chip memory energy compared [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: FFM strategically fuses Einsums with low com [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 11
Figure 11. Figure 11: FFM reduces EDP, latency, and energy per token [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
read the original abstract

A low-latency and energy-efficient tensor algebra accelerator design must optimize how data movement and operations are scheduled (i.e., mapped) in the accelerator architecture. A key mapping optimization is fusion, meaning holding data on-chip between computation steps in the workload, which has been shown to reduce energy and latency by reducing expensive off-chip data movement. However, the optimal fusion choice depends on the workload and workload shape, and a mapper, which searches for the optimal mapping, can improve energy and latency significantly. However, prior mappers cannot find optimal mappings with fusion (i.e., fused mappings) in a feasible runtime because the number of fused mappings to search increases exponentially with the number of computation steps in the workload. In this paper, we introduce the Fast and Fusiest Mapper (FFM), a mapper to quickly find optimal mappings in a comprehensive fused mapspace for tensor algebra workloads. FFM shrinks the search space by pruning subsets of mappings (i.e., partial mappings) that are shown to never be a part of optimal mappings, quickly eliminating all suboptimal mappings containing those partial mappings. Then FFM joins partial mappings to construct optimal fused mappings. Using FFM, we demonstrate an energy-delay-product (EDP) reduction by up to $1.8\times$ compared to TransFusion, a state-of-the-art accelerator with hand-optimized fusion. Moreover, we show that FFM finds mappings orders of magnitude faster ($>10,000\times$) than prior automated mappers TileFlow and SET, and given the same runtime, reduces EDP by $>2\times$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Fast and Fusiest Mapper (FFM), which prunes partial mappings shown to never appear in optimal fused mappings for tensor algebra workloads on accelerators, then joins the surviving partial mappings to construct complete optimal fused mappings. It claims this enables exhaustive search in a comprehensive fused mapspace, yielding up to 1.8× EDP reduction versus the hand-optimized TransFusion accelerator and >10,000× faster mapping than TileFlow and SET (with >2× EDP improvement at equal runtime).

Significance. If the pruning predicate is sound, FFM would make optimal fusion-aware mapping tractable for workloads with many computation steps, directly addressing the exponential mapspace growth that limits prior automated mappers. This could improve automated design-space exploration for energy-efficient tensor accelerators beyond current hand-tuned or heuristic approaches.

major comments (3)
  1. [§3] §3 (Pruning Rules): The central claim that the pruning rules eliminate only mappings that 'can never be a part of optimal mappings' is presented without a formal invariant, proof sketch, or exhaustive enumeration on representative workloads; the rules depend on unstated assumptions about fusion semantics, dataflow reuse, and cost-model monotonicity that must hold for arbitrary tensor-algebra patterns.
  2. [§5] §5 (Experimental Results): Concrete claims of 1.8× EDP reduction and >10,000× speedup are reported, yet the manuscript supplies no details on workload selection, accelerator parameters, cost-model implementation, or any validation that the pruned mapspace still contains the true optimum; this leaves the performance numbers unsupported and non-reproducible from the given text.
  3. [§4.2] §4.2 (Joining Step): The joining procedure that reconstructs complete fused mappings from pruned partials is described at a high level but does not specify how it guarantees completeness (i.e., that every optimal fused mapping is recovered) once pruning has occurred.
minor comments (2)
  1. [Abstract] Abstract: The title uses 'Fusiest' but the term is never defined or motivated in the body; a brief clarification of the intended meaning would improve readability.
  2. [Figure 2] Figure 2: The diagram of the pruning and joining process lacks labels on the partial-mapping nodes and does not indicate which pruning rules are applied at each step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We address each major comment below and will revise the manuscript to strengthen the formal arguments, add missing experimental details, and clarify the joining procedure.

read point-by-point responses
  1. Referee: [§3] §3 (Pruning Rules): The central claim that the pruning rules eliminate only mappings that 'can never be a part of optimal mappings' is presented without a formal invariant, proof sketch, or exhaustive enumeration on representative workloads; the rules depend on unstated assumptions about fusion semantics, dataflow reuse, and cost-model monotonicity that must hold for arbitrary tensor-algebra patterns.

    Authors: We agree that the current presentation lacks a formal invariant and proof sketch. In the revised manuscript we will add to §3 a proof sketch based on cost-model monotonicity (any increase in partial cost cannot be offset by later fusion choices) together with the standard fusion semantics that fused steps must share on-chip buffers without intermediate DRAM writes. We will also report exhaustive enumeration results on representative small workloads (chains of 3–5 GEMMs and CONVs) confirming that no optimal mapping is eliminated. The assumptions on fusion semantics, dataflow reuse, and monotonicity will be stated explicitly at the start of the section. revision: yes

  2. Referee: [§5] §5 (Experimental Results): Concrete claims of 1.8× EDP reduction and >10,000× speedup are reported, yet the manuscript supplies no details on workload selection, accelerator parameters, cost-model implementation, or any validation that the pruned mapspace still contains the true optimum; this leaves the performance numbers unsupported and non-reproducible from the given text.

    Authors: We accept that the experimental section is insufficiently detailed for reproducibility. The revised §5 will enumerate the exact workloads (MLPerf-derived GEMM/CONV shapes plus synthetic chains), accelerator parameters (PE array size, buffer capacities, NoC bandwidth), and cost-model implementation (Timeloop/Accelergy energy and latency tables). We will also add a validation subsection that compares FFM against exhaustive search on reduced mapspaces to confirm the pruned space retains the true optimum, thereby supporting the reported 1.8× EDP and >10,000× speedup figures. revision: yes

  3. Referee: [§4.2] §4.2 (Joining Step): The joining procedure that reconstructs complete fused mappings from pruned partials is described at a high level but does not specify how it guarantees completeness (i.e., that every optimal fused mapping is recovered) once pruning has occurred.

    Authors: We will expand §4.2 with pseudocode for the join operation and an explicit completeness argument: because the pruning rules of §3 remove only partial mappings that cannot participate in any optimal solution, every optimal complete mapping is composed exclusively of surviving partials; the join enumerates all compatible combinations of those partials and therefore recovers every optimal fused mapping. A short inductive argument will be included to show that no optimal mapping is lost. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core mechanism prunes partial mappings claimed to never appear in optimal fused mappings before joining survivors to form complete mappings. This is presented as an analysis of optimality invariants for tensor algebra operations and the accelerator cost model, with results validated empirically against external baselines (TransFusion, TileFlow, SET). No equations or steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the pruning predicate is asserted from workload properties rather than derived from the target result itself. The derivation remains self-contained and externally falsifiable via the reported EDP and runtime comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the pruning correctness is treated as an unstated domain assumption whose validity cannot be audited from the given text.

pith-pipeline@v0.9.0 · 5593 in / 1019 out tokens · 23061 ms · 2026-05-15T21:29:24.399590+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 3 internal anchors

  1. [1]

    Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12. https://doi.org/10.1109/MICRO.2016.7783725

  2. [2]

    Tanner Andrulis. [n. d.].HWComponents. https://github.com/Accelergy-Project/ hwcomponents

  3. [3]

    Emer, and Vivienne Sze

    Tanner Andrulis, Joel S. Emer, and Vivienne Sze. 2023. RAELLA: Reform- ing the Arithmetic for Efficient, Low-Resolution, and Low-Loss Analog PIM: No Retraining Required!. InProceedings of the 50th Annual International Sym- posium on Computer Architecture(Orlando, FL, USA)(ISCA ’23). Association for Computing Machinery, New York, NY, USA, Article 27, 16 pa...

  4. [4]

    Emer, and Vivienne Sze

    Tanner Andrulis, Joel S. Emer, and Vivienne Sze. 2024. CiMLoop: A Flexible, Accurate, and Fast Compute-In-Memory Modeling Tool. In2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10–23. https://doi.org/10.1109/ISPASS61541.2024.00012

  5. [5]

    Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas

    Rajeev Balasubramonian, Andrew B. Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. 2017. CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories.ACM Trans. Archit. Code Optim.14, 2, Article 14 (June 2017), 25 pages. https://doi.org/10.1145/3085572

  6. [6]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott ...

  7. [7]

    Jingwei Cai, Yuchen Wei, Zuotong Wu, Sen Peng, and Kaisheng Ma. 2023. Inter- Layer Scheduling Space Definition and Exploration for Tiled Accelerators. In Proceedings of the 50th Annual International Symposium on Computer Architecture (Orlando, FL, USA)(ISCA ’23). Association for Computing Machinery, New York, NY, USA, Article 13, 17 pages. https://doi.org...

  8. [8]

    Xuyi Cai, Ying Wang, and Lei Zhang. 2022. Optimus: An Operator Fusion Framework for Deep Neural Networks.ACM Trans. Embed. Comput. Syst.22, 1, Article 1 (oct 2022), 26 pages. https://doi.org/10.1145/3520142

  9. [9]

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsba...

  10. [10]

    https://www.usenix.org/conference/osdi18/presentation/chen

  11. [11]

    Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architec- ture for Energy-Efficient Dataflow for Convolutional Neural Networks. In2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 367–379. https://doi.org/10.1109/ISCA.2016.40

  12. [12]

    Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InThe Twelfth International Conference on Learning Represen- tations. https://openreview.net/forum?id=mZn2Xyh9Ec

  13. [13]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv e-prints, Article arXiv:2205.14135 (May 2022), arXiv:2205.14135 pages. https: //doi.org/10.48550/arXiv.2205.14135 arXiv:2205.14135 [cs.LG]

  14. [14]

    Einstein

    A. Einstein. 1916. The Foundation of the General Theory of Relativity.Annalen der Physik354, 7 (1916), 769–822. https://doi.org/10.1002/andp.19163540702

  15. [15]

    2023.LoopTree: Enabling Systematic and Flexible Exploration of Fused-layer Dataflow Accelerators

    Michael Gilbert. 2023.LoopTree: Enabling Systematic and Flexible Exploration of Fused-layer Dataflow Accelerators. PhD thesis. Massachusetts Institute of Technology, Cambridge, MA

  16. [16]

    Michael Gilbert, Tanner Andrulis, Vivienne Sze, and Joel S. Emer. 2026. The Turbo-Charged Mapper: Fast and Optimal Mapping for Accelerator Modeling and Evaluation. arXiv:2602.15172 [cs.AR] https://arxiv.org/abs/2602.15172

  17. [17]

    Emer, and Vivienne Sze

    Michael Gilbert, Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. 2024. Loop- Tree: Exploring the Fused-Layer Dataflow Accelerator Design Space.IEEE Transactions on Circuits and Systems for Artificial Intelligence1, 1 (2024), 97–

  18. [18]

    https://doi.org/10.1109/TCASAI.2024.3461716

  19. [19]

    Koen Goetschalckx, Fengfeng Wu, and Marian Verhelst. 2023. DepFiN: A 12-nm Depth-First, High-Resolution CNN Processor for IO-Efficient Inference.IEEE Journal of Solid-State Circuits58, 5 (2023), 1425–1435. https://doi.org/10.1109/ JSSC.2022.3210591

  20. [20]

    Google. 2022. XLA: Optimizing Compiler for Machine Learning. https://www. tensorflow.org/xla

  21. [21]

    Fletcher

    Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. InProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture(Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New ...

  22. [22]

    Fletcher

    Kartik Hegde, Po-An Tsai, Sitao Huang, Vikas Chandra, Angshuman Parashar, and Christopher W. Fletcher. 2021. Mind Mappings: Enabling Efficient Algorithm- Accelerator Mapping Space Search. InIn Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21). IEEE

  23. [23]

    Mark Horeni, Pooria Taheri, Po-An Tsai, Angshuman Parashar, Joel Emer, and Siddharth Joshi. 2022. Ruby: Improving Hardware Efficiency for Tensor Alge- bra Accelerators Through Imperfect Factorization. In2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 254–266. https://doi.org/10.1109/ISPASS55109.2022.00039

  24. [24]

    Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. 2021. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 554–566. https: //doi.org/10.1109/ISCA52012.2021.00050

  25. [25]

    Hongyang Jia, Hossein Valavi, Yinqi Tang, Jintao Zhang, and Naveen Verma

  26. [26]

    https://doi.org/10.1109/JSSC.2020.2987714

    A Programmable Heterogeneous Microprocessor Based on Bit-Scalable In- Memory Computing.IEEE Journal of Solid-State Circuits55, 9 (2020), 2609–2621. https://doi.org/10.1109/JSSC.2020.2987714

  27. [27]

    Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B

    Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. 2021. Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product. In2021 ACM/IEEE 48th Annual Intern...

  28. [28]

    Sheng-Chun Kao and Tushar Krishna. 2020. GAMMA: Automating the HW Map- ping of DNN Models on Accelerators via Genetic Algorithm. In2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9

  29. [29]

    Sheng-Chun Kao, Suvinay Subramanian, Gaurav Agrawal, Amir Yazdanbakhsh, and Tushar Krishna. 2021. FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks. https://doi.org/10.48550/ARXIV.2107.06419

  30. [30]

    Hyunjoon Kim, Taegeun Yoo, Tony Tae-Hyoung Kim, and Bongjin Kim. 2021. Colonnade: A Reconfigurable SRAM-Based Digital Bit-Serial Compute-In- Memory Macro for Processing Neural Networks.IEEE Journal of Solid-State Circuits56, 7 (2021), 2221–2233. https://doi.org/10.1109/JSSC.2021.3061508

  31. [31]

    Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Ama- rasinghe. 2017. Taco: A tool to generate tensor algebra kernels. In2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). 943–948. https://doi.org/10.1109/ASE.2017.8115709

  32. [32]

    Hyoukjun Kwon, Prasanth Chatarasi, Vivek Sarkar, Tushar Krishna, Michael Pellauer, and Angshuman Parashar. 2020. MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings.IEEE Micro40, 3 (2020), 20–29. https://doi.org/10.1109/MM.2020.2985963 12 Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Desi...

  33. [33]

    Marco Laumanns, Lothar Thiele, Kalyanmoy Deb, and Eckart Zitzler. 2002. Com- bining Convergence and Diversity in Evolutionary Multiobjective Optimiza- tion.Evolutionary Computation10, 3 (2002), 263–282. https://doi.org/10.1162/ 106365602760234108

  34. [34]

    L. Mei, K. Goetschalckx, A. Symons, and M. Verhelst. 2023. DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators through Analytical Modeling. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 570–583. https://doi.org/10.1109/HPCA56546.20...

  35. [35]

    Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. 2021. ZigZag: Enlarging Joint Architecture-Mapping Design Space Exploration for DNN Accelerators.IEEE Trans. Comput.70, 8 (2021), 1160–1174. https://doi.org/10.1109/TC.2021.3059962

  36. [36]

    Microsoft. 2025. Develop AI applications for Copilot+ PCs. https://learn.microsoft. com/en-us/windows/ai/npu-devices/. Accessed: 2026-04-06

  37. [37]

    Odemuyiwa, Shubham Ugare, Christopher W

    Nandeeka Nayak, Toluwanimi O. Odemuyiwa, Shubham Ugare, Christopher W. Fletcher, Michael Pellauer, and Joel S. Emer. 2023. TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators.arXiv e-prints, Article arXiv:2304.07931 (April 2023), arXiv:2304.07931 pages. https://doi.org/10.48550/ arXiv.2304.07931 arXiv:2304.07931 [cs.AR]

  38. [38]

    Odemuyiwa, Michael Pellauer, Joel S

    Nandeeka Nayak, Xinrui Wu, Toluwanimi O. Odemuyiwa, Michael Pellauer, Joel S. Emer, and Christopher W. Fletcher. 2024. FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design. InProceedings of the 57th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’24). Association for Computing Machinery, New York, NY, USA

  39. [39]

    The EDGE Language: Extended General Einsums for Graph Algorithms

    Toluwanimi O. Odemuyiwa, Joel S. Emer, and John D. Owens. 2024. The EDGE Language: Extended General Einsums for Graph Algorithms.arXiv e-prints, Article arXiv:2404.11591 (April 2024), arXiv:2404.11591 pages. https://doi.org/10. 48550/arXiv.2404.11591 arXiv:2404.11591 [cs.DS]

  40. [40]

    Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W

    Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In2019 IEEE International Symposium on Perfor- mance Analysis of Systems and Software (ISPASS). 304–315. https:...

  41. [41]

    Stanley Williams, and Vivek Srikumar

    Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arith- metic in Crossbars. In2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 14–26. https://doi.org/10.1...

  42. [42]

    Kyle Shiflett, Avinash Karanth, Razvan Bunescu, and Ahmed Louri. 2021. Al- bireo: energy-efficient acceleration of convolutional neural networks via sili- con photonics. InProceedings of the 48th Annual International Symposium on Computer Architecture(Virtual Event, Spain)(ISCA ’21). IEEE Press, 860–873. https://doi.org/10.1109/ISCA52012.2021.00072

  43. [43]

    Sinangil, Burak Erbagci, Rawan Naous, Kerem Akarvardar, Dar Sun, Win-San Khwa, Hung-Jen Liao, Yih Wang, and Jonathan Chang

    Mahmut E. Sinangil, Burak Erbagci, Rawan Naous, Kerem Akarvardar, Dar Sun, Win-San Khwa, Hung-Jen Liao, Yih Wang, and Jonathan Chang. 2021. A 7- nm Compute-in-Memory SRAM Macro Supporting Multi-Bit Input, Weight and Output and Achieving 351 TOPS/W and 372.4 GOPS.IEEE Journal of Solid-State Circuits56, 1 (2021), 188–198. https://doi.org/10.1109/JSSC.2020.3031290

  44. [44]

    Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2020. Efficient Processing of Deep Neural Networks.Synthesis Lectures on Computer Architec- ture15, 2 (2020), 1–341. https://doi.org/10.2200/S01004ED1V01Y202004CAC050 arXiv:https://doi.org/10.2200/S01004ED1V01Y202004CAC050

  45. [45]

    Luc Waeijen, Savvas Sioutas, Maurice Peemen, Menno Lindwer, and Henk Cor- poraal. 2021. ConvFusion: A Model for Layer Fusion in Convolutional Neural Networks.IEEE Access9 (2021), 168245–168267. https://doi.org/10.1109/ACCESS. 2021.3134930

  46. [46]

    Burc Eryilmaz, Wenqiang Zhang, Yan Liao, Dabin Wu, Stephen Deiss, Bin Gao, Priyanka Raina, Siddharth Joshi, Huaqiang Wu, Gert Cauwenberghs, and H.-S

    Weier Wan, Rajkumar Kubendran, S. Burc Eryilmaz, Wenqiang Zhang, Yan Liao, Dabin Wu, Stephen Deiss, Bin Gao, Priyanka Raina, Siddharth Joshi, Huaqiang Wu, Gert Cauwenberghs, and H.-S. Philip Wong. 2020. 33.1 A 74 TMACS/W CMOS-RRAM Neurosynaptic Core with Dynamically Reconfigurable Dataflow and In-situ Transposable Weights for Probabilistic Graphical Model...

  47. [47]

    Philip Wong, and Gert Cauwenberghs

    Weier Wan, Rajkumar Kubendran, Clemens Schaefer, Sukru Burc Eryilmaz, Wen- qiang Zhang, Dabin Wu, Stephen Deiss, Priyanka Raina, He Qian, Bin Gao, Siddharth Joshi, Huaqiang Wu, H.-S. Philip Wong, and Gert Cauwenberghs. 2022. A compute-in-memory chip based on resistive random-access memory.Nature 608, 7923 (Aug. 2022), 504–512. https://doi.org/10.1038/s415...

  48. [48]

    Hechen Wang, Renzhi Liu, Richard Dorrance, Deepak Dasalukunte, Dan Lake, and Brent Carlton. 2023. A Charge Domain SRAM Compute-in-Memory Macro With C-2C Ladder-Based 8-Bit MAC Unit in 22-nm FinFET Process for Edge Inference.IEEE Journal of Solid-State Circuits58, 4 (2023), 1037–1050. https: //doi.org/10.1109/JSSC.2022.3232601

  49. [49]

    Hechen Wang, Renzhi Liu, Richard Dorrance, Deepak Dasalukunte, Xiaosen Liu, Dan Lake, Brent Carlton, and May Wu. 2022. A 32.2 TOPS/W SRAM Compute- in-Memory Macro Employing a Linear 8-bit C-2C Ladder for Charge Domain Computation in 22nm for Edge Inference. In2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). 36–37. https:...

  50. [50]

    2023.SET Artifacts

    Yuchen Wei, Jingwei Cai, Zuotong Wu, Sen Peng, and Kaisheng Ma. 2023.SET Artifacts. https://doi.org/10.5281/zenodo.7751328

  51. [51]

    Emer, and Vivienne Sze

    Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. 2019. Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs. In2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8. https://doi.org/10.1109/ICCAD45719.2019.8942149

  52. [52]

    Nelson Amaral, and Di Niu

    Linxuan Zhang, J. Nelson Amaral, and Di Niu. 2025.TransFusion: End-to-End Transformer Acceleration via Graph Fusion and Pipelining. Association for Com- puting Machinery, New York, NY, USA, 1491–1504. https://doi.org/10.1145/ 3725843.3756105

  53. [53]

    Size Zheng, Siyuan Chen, Siyuan Gao, Liancheng Jia, Guangyu Sun, Runsheng Wang, and Yun Liang. 2023. TileFlow: A Framework for Modeling Fusion Dataflow via Tree-Based Analysis. InProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture(<conf-loc>, <city>Toronto</city>, <state>ON</state>, <country>Canada</country>, </conf-loc>)(...

  54. [54]

    Zhizhen Zhong, Mingran Yang, Jay Lang, Christian Williams, Liam Kronman, Alexander Sludds, Homa Esfahanizadeh, Dirk Englund, and Manya Ghobadi. 2023. Lightning: A Reconfigurable Photonic-Electronic SmartNIC for Fast and Energy- Efficient Inference. InProceedings of the ACM SIGCOMM 2023 Conference(New York, NY, USA)(ACM SIGCOMM ’23). Association for Comput...