arxiv: 2602.15166 · v2 · submitted 2026-02-16 · 💻 cs.AR

Recognition: no theorem link

Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Design

Tanner Andrulis , Michael Gilbert , Vivienne Sze , Joel S. Emer

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:29 UTC · model grok-4.3

classification 💻 cs.AR

keywords tensor accelerator mappingfusion optimizationenergy-delay productautomated mapperpruningtensor algebradata movement scheduling

0 comments

The pith

FFM prunes partial mappings that cannot lead to optimal fusion to search the full fused mapspace for tensor accelerators in feasible time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Fast and Fusiest Mapper (FFM) as a way to automatically discover the best way to schedule data movement and operations on tensor algebra accelerators while keeping intermediate results on-chip through fusion. Prior automated mappers could not explore the full space of fused mappings because the number of candidates grows exponentially with the number of computation steps, forcing designers to rely on hand-tuned fusion choices. FFM eliminates large subsets of mappings shown never to appear in any optimal schedule, then assembles the remaining partial mappings into complete fused schedules. The resulting mappings deliver up to 1.8 times lower energy-delay product than the hand-optimized TransFusion accelerator and run more than 10,000 times faster than TileFlow or SET.

Core claim

FFM shrinks the search space by pruning subsets of mappings that are shown to never be a part of optimal mappings, quickly eliminating all suboptimal mappings containing those partial mappings, then joins the surviving partial mappings to construct optimal fused mappings.

What carries the argument

The Fast and Fusiest Mapper (FFM) pruning rules on partial mappings, which remove any partial schedule proven never to participate in a globally optimal fused mapping before the remaining pieces are joined into complete schedules.

Load-bearing premise

The pruning rules correctly eliminate only those partial mappings that can never belong to an optimal fused schedule for the given workload shapes and fusion semantics.

What would settle it

A workload in which an optimal fused mapping is known to contain a partial mapping that FFM's rules prune would falsify the pruning correctness claim.

Figures

Figures reproduced from arXiv: 2602.15166 by Joel S. Emer, Michael Gilbert, Tanner Andrulis, Vivienne Sze.

**Figure 2.** Figure 2: (a) Pseudocode of a mapping for two matrix-vector [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: An example constrained optimization and (b) ex [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: (a) ReservationTree for the mapping in Fig. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Overview of the Fast and Fusiest Mapper (FFM). [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Convergence speeds of baselines relative to FFM. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: FFM is fast and per-Einsum runtime remains flat [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 12.** Figure 12: (a) FFM reduces off-chip memory energy compared [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: FFM strategically fuses Einsums with low com [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 11.** Figure 11: FFM reduces EDP, latency, and energy per token [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

read the original abstract

A low-latency and energy-efficient tensor algebra accelerator design must optimize how data movement and operations are scheduled (i.e., mapped) in the accelerator architecture. A key mapping optimization is fusion, meaning holding data on-chip between computation steps in the workload, which has been shown to reduce energy and latency by reducing expensive off-chip data movement. However, the optimal fusion choice depends on the workload and workload shape, and a mapper, which searches for the optimal mapping, can improve energy and latency significantly. However, prior mappers cannot find optimal mappings with fusion (i.e., fused mappings) in a feasible runtime because the number of fused mappings to search increases exponentially with the number of computation steps in the workload. In this paper, we introduce the Fast and Fusiest Mapper (FFM), a mapper to quickly find optimal mappings in a comprehensive fused mapspace for tensor algebra workloads. FFM shrinks the search space by pruning subsets of mappings (i.e., partial mappings) that are shown to never be a part of optimal mappings, quickly eliminating all suboptimal mappings containing those partial mappings. Then FFM joins partial mappings to construct optimal fused mappings. Using FFM, we demonstrate an energy-delay-product (EDP) reduction by up to $1.8\times$ compared to TransFusion, a state-of-the-art accelerator with hand-optimized fusion. Moreover, we show that FFM finds mappings orders of magnitude faster ($>10,000\times$) than prior automated mappers TileFlow and SET, and given the same runtime, reduces EDP by $>2\times$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FFM adds a pruning step for fused mapping search that delivers big reported speedups, but the soundness of those rules is not yet shown in detail.

read the letter

The main takeaway is that FFM prunes partial mappings that cannot appear in any optimal fused schedule, then joins the survivors to build complete mappings. This is the concrete new technique, and it is what lets them claim search times orders of magnitude faster than TileFlow or SET while still producing mappings with better EDP than hand-tuned TransFusion (up to 1.8×). The paper correctly identifies that fusion choices explode the mapspace and that prior automated mappers could not handle the full space in reasonable time. By focusing pruning on provably suboptimal prefixes, the approach makes exhaustive-style search tractable for multi-step tensor workloads. That is a practical step forward for anyone building design tools for energy-efficient accelerators. The reported gains are large enough to matter if they hold: faster mapping plus lower EDP under the same runtime budget. The soft spot is the pruning predicate itself. The abstract states the rules eliminate only mappings that can never be optimal, but gives no invariants on dataflow, reuse, or cost monotonicity that would make this true across workloads. No machine-checked proof or exhaustive enumeration on a test suite is mentioned, so the central claim rests on an assumption that may or may not survive closer inspection. Experimental details are also thin: no workload shapes, no description of how baselines were configured, and no validation that the pruned space still contains the true optimum. These gaps are fixable but currently leave the performance numbers under-supported. The paper is for hardware designers and tool builders who need to map fused tensor algebra onto accelerators. A reader who already works with mappers like TileFlow will see a clear algorithmic idea worth testing on their own workloads. It is not yet a drop-in replacement, but the pruning direction is worth following. I would send it to peer review. The core contribution is narrow enough to evaluate rigorously and important enough to the field that referees should see the full evidence on the pruning rules.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Fast and Fusiest Mapper (FFM), which prunes partial mappings shown to never appear in optimal fused mappings for tensor algebra workloads on accelerators, then joins the surviving partial mappings to construct complete optimal fused mappings. It claims this enables exhaustive search in a comprehensive fused mapspace, yielding up to 1.8× EDP reduction versus the hand-optimized TransFusion accelerator and >10,000× faster mapping than TileFlow and SET (with >2× EDP improvement at equal runtime).

Significance. If the pruning predicate is sound, FFM would make optimal fusion-aware mapping tractable for workloads with many computation steps, directly addressing the exponential mapspace growth that limits prior automated mappers. This could improve automated design-space exploration for energy-efficient tensor accelerators beyond current hand-tuned or heuristic approaches.

major comments (3)

[§3] §3 (Pruning Rules): The central claim that the pruning rules eliminate only mappings that 'can never be a part of optimal mappings' is presented without a formal invariant, proof sketch, or exhaustive enumeration on representative workloads; the rules depend on unstated assumptions about fusion semantics, dataflow reuse, and cost-model monotonicity that must hold for arbitrary tensor-algebra patterns.
[§5] §5 (Experimental Results): Concrete claims of 1.8× EDP reduction and >10,000× speedup are reported, yet the manuscript supplies no details on workload selection, accelerator parameters, cost-model implementation, or any validation that the pruned mapspace still contains the true optimum; this leaves the performance numbers unsupported and non-reproducible from the given text.
[§4.2] §4.2 (Joining Step): The joining procedure that reconstructs complete fused mappings from pruned partials is described at a high level but does not specify how it guarantees completeness (i.e., that every optimal fused mapping is recovered) once pruning has occurred.

minor comments (2)

[Abstract] Abstract: The title uses 'Fusiest' but the term is never defined or motivated in the body; a brief clarification of the intended meaning would improve readability.
[Figure 2] Figure 2: The diagram of the pruning and joining process lacks labels on the partial-mapping nodes and does not indicate which pruning rules are applied at each step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We address each major comment below and will revise the manuscript to strengthen the formal arguments, add missing experimental details, and clarify the joining procedure.

read point-by-point responses

Referee: [§3] §3 (Pruning Rules): The central claim that the pruning rules eliminate only mappings that 'can never be a part of optimal mappings' is presented without a formal invariant, proof sketch, or exhaustive enumeration on representative workloads; the rules depend on unstated assumptions about fusion semantics, dataflow reuse, and cost-model monotonicity that must hold for arbitrary tensor-algebra patterns.

Authors: We agree that the current presentation lacks a formal invariant and proof sketch. In the revised manuscript we will add to §3 a proof sketch based on cost-model monotonicity (any increase in partial cost cannot be offset by later fusion choices) together with the standard fusion semantics that fused steps must share on-chip buffers without intermediate DRAM writes. We will also report exhaustive enumeration results on representative small workloads (chains of 3–5 GEMMs and CONVs) confirming that no optimal mapping is eliminated. The assumptions on fusion semantics, dataflow reuse, and monotonicity will be stated explicitly at the start of the section. revision: yes
Referee: [§5] §5 (Experimental Results): Concrete claims of 1.8× EDP reduction and >10,000× speedup are reported, yet the manuscript supplies no details on workload selection, accelerator parameters, cost-model implementation, or any validation that the pruned mapspace still contains the true optimum; this leaves the performance numbers unsupported and non-reproducible from the given text.

Authors: We accept that the experimental section is insufficiently detailed for reproducibility. The revised §5 will enumerate the exact workloads (MLPerf-derived GEMM/CONV shapes plus synthetic chains), accelerator parameters (PE array size, buffer capacities, NoC bandwidth), and cost-model implementation (Timeloop/Accelergy energy and latency tables). We will also add a validation subsection that compares FFM against exhaustive search on reduced mapspaces to confirm the pruned space retains the true optimum, thereby supporting the reported 1.8× EDP and >10,000× speedup figures. revision: yes
Referee: [§4.2] §4.2 (Joining Step): The joining procedure that reconstructs complete fused mappings from pruned partials is described at a high level but does not specify how it guarantees completeness (i.e., that every optimal fused mapping is recovered) once pruning has occurred.

Authors: We will expand §4.2 with pseudocode for the join operation and an explicit completeness argument: because the pruning rules of §3 remove only partial mappings that cannot participate in any optimal solution, every optimal complete mapping is composed exclusively of surviving partials; the join enumerates all compatible combinations of those partials and therefore recovers every optimal fused mapping. A short inductive argument will be included to show that no optimal mapping is lost. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core mechanism prunes partial mappings claimed to never appear in optimal fused mappings before joining survivors to form complete mappings. This is presented as an analysis of optimality invariants for tensor algebra operations and the accelerator cost model, with results validated empirically against external baselines (TransFusion, TileFlow, SET). No equations or steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the pruning predicate is asserted from workload properties rather than derived from the target result itself. The derivation remains self-contained and externally falsifiable via the reported EDP and runtime comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the pruning correctness is treated as an unstated domain assumption whose validity cannot be audited from the given text.

pith-pipeline@v0.9.0 · 5593 in / 1019 out tokens · 23061 ms · 2026-05-15T21:29:24.399590+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 3 internal anchors

[1]

Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12. https://doi.org/10.1109/MICRO.2016.7783725

work page doi:10.1109/micro.2016.7783725 2016
[2]

Tanner Andrulis. [n. d.].HWComponents. https://github.com/Accelergy-Project/ hwcomponents

work page
[3]

Emer, and Vivienne Sze

Tanner Andrulis, Joel S. Emer, and Vivienne Sze. 2023. RAELLA: Reform- ing the Arithmetic for Efficient, Low-Resolution, and Low-Loss Analog PIM: No Retraining Required!. InProceedings of the 50th Annual International Sym- posium on Computer Architecture(Orlando, FL, USA)(ISCA ’23). Association for Computing Machinery, New York, NY, USA, Article 27, 16 pa...

work page doi:10.1145/3579371.3589062 2023
[4]

Emer, and Vivienne Sze

Tanner Andrulis, Joel S. Emer, and Vivienne Sze. 2024. CiMLoop: A Flexible, Accurate, and Fast Compute-In-Memory Modeling Tool. In2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10–23. https://doi.org/10.1109/ISPASS61541.2024.00012

work page doi:10.1109/ispass61541.2024.00012 2024
[5]

Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas

Rajeev Balasubramonian, Andrew B. Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. 2017. CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories.ACM Trans. Archit. Code Optim.14, 2, Article 14 (June 2017), 25 pages. https://doi.org/10.1145/3085572

work page doi:10.1145/3085572 2017
[6]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott ...

work page 2020
[7]

Jingwei Cai, Yuchen Wei, Zuotong Wu, Sen Peng, and Kaisheng Ma. 2023. Inter- Layer Scheduling Space Definition and Exploration for Tiled Accelerators. In Proceedings of the 50th Annual International Symposium on Computer Architecture (Orlando, FL, USA)(ISCA ’23). Association for Computing Machinery, New York, NY, USA, Article 13, 17 pages. https://doi.org...

work page doi:10.1145/3579371.3589048 2023
[8]

Xuyi Cai, Ying Wang, and Lei Zhang. 2022. Optimus: An Operator Fusion Framework for Deep Neural Networks.ACM Trans. Embed. Comput. Syst.22, 1, Article 1 (oct 2022), 26 pages. https://doi.org/10.1145/3520142

work page doi:10.1145/3520142 2022
[9]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsba...

work page 2018
[10]

https://www.usenix.org/conference/osdi18/presentation/chen

work page
[11]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architec- ture for Energy-Efficient Dataflow for Convolutional Neural Networks. In2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 367–379. https://doi.org/10.1109/ISCA.2016.40

work page doi:10.1109/isca.2016.40 2016
[12]

Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InThe Twelfth International Conference on Learning Represen- tations. https://openreview.net/forum?id=mZn2Xyh9Ec

work page 2024
[13]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv e-prints, Article arXiv:2205.14135 (May 2022), arXiv:2205.14135 pages. https: //doi.org/10.48550/arXiv.2205.14135 arXiv:2205.14135 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.14135 2022
[14]

Einstein

A. Einstein. 1916. The Foundation of the General Theory of Relativity.Annalen der Physik354, 7 (1916), 769–822. https://doi.org/10.1002/andp.19163540702

work page doi:10.1002/andp.19163540702 1916
[15]

2023.LoopTree: Enabling Systematic and Flexible Exploration of Fused-layer Dataflow Accelerators

Michael Gilbert. 2023.LoopTree: Enabling Systematic and Flexible Exploration of Fused-layer Dataflow Accelerators. PhD thesis. Massachusetts Institute of Technology, Cambridge, MA

work page 2023
[16]

Michael Gilbert, Tanner Andrulis, Vivienne Sze, and Joel S. Emer. 2026. The Turbo-Charged Mapper: Fast and Optimal Mapping for Accelerator Modeling and Evaluation. arXiv:2602.15172 [cs.AR] https://arxiv.org/abs/2602.15172

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Emer, and Vivienne Sze

Michael Gilbert, Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. 2024. Loop- Tree: Exploring the Fused-Layer Dataflow Accelerator Design Space.IEEE Transactions on Circuits and Systems for Artificial Intelligence1, 1 (2024), 97–

work page 2024
[18]

https://doi.org/10.1109/TCASAI.2024.3461716

work page doi:10.1109/tcasai.2024.3461716 2024
[19]

Koen Goetschalckx, Fengfeng Wu, and Marian Verhelst. 2023. DepFiN: A 12-nm Depth-First, High-Resolution CNN Processor for IO-Efficient Inference.IEEE Journal of Solid-State Circuits58, 5 (2023), 1425–1435. https://doi.org/10.1109/ JSSC.2022.3210591

work page arXiv 2023
[20]

Google. 2022. XLA: Optimizing Compiler for Machine Learning. https://www. tensorflow.org/xla

work page 2022
[21]

Fletcher

Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. InProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture(Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New ...

work page doi:10.1145/3352460.3358275 2019
[22]

Fletcher

Kartik Hegde, Po-An Tsai, Sitao Huang, Vikas Chandra, Angshuman Parashar, and Christopher W. Fletcher. 2021. Mind Mappings: Enabling Efficient Algorithm- Accelerator Mapping Space Search. InIn Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21). IEEE

work page 2021
[23]

Mark Horeni, Pooria Taheri, Po-An Tsai, Angshuman Parashar, Joel Emer, and Siddharth Joshi. 2022. Ruby: Improving Hardware Efficiency for Tensor Alge- bra Accelerators Through Imperfect Factorization. In2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 254–266. https://doi.org/10.1109/ISPASS55109.2022.00039

work page doi:10.1109/ispass55109.2022.00039 2022
[24]

Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. 2021. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 554–566. https: //doi.org/10.1109/ISCA52012.2021.00050

work page doi:10.1109/isca52012.2021.00050 2021
[25]

Hongyang Jia, Hossein Valavi, Yinqi Tang, Jintao Zhang, and Naveen Verma

work page
[26]

https://doi.org/10.1109/JSSC.2020.2987714

A Programmable Heterogeneous Microprocessor Based on Bit-Scalable In- Memory Computing.IEEE Journal of Solid-State Circuits55, 9 (2020), 2609–2621. https://doi.org/10.1109/JSSC.2020.2987714

work page doi:10.1109/jssc.2020.2987714 2020
[27]

Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B

Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. 2021. Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product. In2021 ACM/IEEE 48th Annual Intern...

work page doi:10.1109/isca52012.2021 2021
[28]

Sheng-Chun Kao and Tushar Krishna. 2020. GAMMA: Automating the HW Map- ping of DNN Models on Accelerators via Genetic Algorithm. In2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9

work page 2020
[29]

Sheng-Chun Kao, Suvinay Subramanian, Gaurav Agrawal, Amir Yazdanbakhsh, and Tushar Krishna. 2021. FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks. https://doi.org/10.48550/ARXIV.2107.06419

work page doi:10.48550/arxiv.2107.06419 2021
[30]

Hyunjoon Kim, Taegeun Yoo, Tony Tae-Hyoung Kim, and Bongjin Kim. 2021. Colonnade: A Reconfigurable SRAM-Based Digital Bit-Serial Compute-In- Memory Macro for Processing Neural Networks.IEEE Journal of Solid-State Circuits56, 7 (2021), 2221–2233. https://doi.org/10.1109/JSSC.2021.3061508

work page doi:10.1109/jssc.2021.3061508 2021
[31]

Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Ama- rasinghe. 2017. Taco: A tool to generate tensor algebra kernels. In2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). 943–948. https://doi.org/10.1109/ASE.2017.8115709

work page doi:10.1109/ase.2017.8115709 2017
[32]

Hyoukjun Kwon, Prasanth Chatarasi, Vivek Sarkar, Tushar Krishna, Michael Pellauer, and Angshuman Parashar. 2020. MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings.IEEE Micro40, 3 (2020), 20–29. https://doi.org/10.1109/MM.2020.2985963 12 Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Desi...

work page doi:10.1109/mm.2020.2985963 2020
[33]

Marco Laumanns, Lothar Thiele, Kalyanmoy Deb, and Eckart Zitzler. 2002. Com- bining Convergence and Diversity in Evolutionary Multiobjective Optimiza- tion.Evolutionary Computation10, 3 (2002), 263–282. https://doi.org/10.1162/ 106365602760234108

work page 2002
[34]

L. Mei, K. Goetschalckx, A. Symons, and M. Verhelst. 2023. DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators through Analytical Modeling. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 570–583. https://doi.org/10.1109/HPCA56546.20...

work page doi:10.1109/hpca56546.2023.10071098 2023
[35]

Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. 2021. ZigZag: Enlarging Joint Architecture-Mapping Design Space Exploration for DNN Accelerators.IEEE Trans. Comput.70, 8 (2021), 1160–1174. https://doi.org/10.1109/TC.2021.3059962

work page doi:10.1109/tc.2021.3059962 2021
[36]

Microsoft. 2025. Develop AI applications for Copilot+ PCs. https://learn.microsoft. com/en-us/windows/ai/npu-devices/. Accessed: 2026-04-06

work page 2025
[37]

Odemuyiwa, Shubham Ugare, Christopher W

Nandeeka Nayak, Toluwanimi O. Odemuyiwa, Shubham Ugare, Christopher W. Fletcher, Michael Pellauer, and Joel S. Emer. 2023. TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators.arXiv e-prints, Article arXiv:2304.07931 (April 2023), arXiv:2304.07931 pages. https://doi.org/10.48550/ arXiv.2304.07931 arXiv:2304.07931 [cs.AR]

work page arXiv 2023
[38]

Odemuyiwa, Michael Pellauer, Joel S

Nandeeka Nayak, Xinrui Wu, Toluwanimi O. Odemuyiwa, Michael Pellauer, Joel S. Emer, and Christopher W. Fletcher. 2024. FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design. InProceedings of the 57th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’24). Association for Computing Machinery, New York, NY, USA

work page 2024
[39]

The EDGE Language: Extended General Einsums for Graph Algorithms

Toluwanimi O. Odemuyiwa, Joel S. Emer, and John D. Owens. 2024. The EDGE Language: Extended General Einsums for Graph Algorithms.arXiv e-prints, Article arXiv:2404.11591 (April 2024), arXiv:2404.11591 pages. https://doi.org/10. 48550/arXiv.2404.11591 arXiv:2404.11591 [cs.DS]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W

Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In2019 IEEE International Symposium on Perfor- mance Analysis of Systems and Software (ISPASS). 304–315. https:...

work page arXiv 2019
[41]

Stanley Williams, and Vivek Srikumar

Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arith- metic in Crossbars. In2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 14–26. https://doi.org/10.1...

work page doi:10.1109/isca.2016.12 2016
[42]

Kyle Shiflett, Avinash Karanth, Razvan Bunescu, and Ahmed Louri. 2021. Al- bireo: energy-efficient acceleration of convolutional neural networks via sili- con photonics. InProceedings of the 48th Annual International Symposium on Computer Architecture(Virtual Event, Spain)(ISCA ’21). IEEE Press, 860–873. https://doi.org/10.1109/ISCA52012.2021.00072

work page doi:10.1109/isca52012.2021.00072 2021
[43]

Sinangil, Burak Erbagci, Rawan Naous, Kerem Akarvardar, Dar Sun, Win-San Khwa, Hung-Jen Liao, Yih Wang, and Jonathan Chang

Mahmut E. Sinangil, Burak Erbagci, Rawan Naous, Kerem Akarvardar, Dar Sun, Win-San Khwa, Hung-Jen Liao, Yih Wang, and Jonathan Chang. 2021. A 7- nm Compute-in-Memory SRAM Macro Supporting Multi-Bit Input, Weight and Output and Achieving 351 TOPS/W and 372.4 GOPS.IEEE Journal of Solid-State Circuits56, 1 (2021), 188–198. https://doi.org/10.1109/JSSC.2020.3031290

work page doi:10.1109/jssc.2020.3031290 2021
[44]

Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2020. Efficient Processing of Deep Neural Networks.Synthesis Lectures on Computer Architec- ture15, 2 (2020), 1–341. https://doi.org/10.2200/S01004ED1V01Y202004CAC050 arXiv:https://doi.org/10.2200/S01004ED1V01Y202004CAC050

work page doi:10.2200/s01004ed1v01y202004cac050 2020
[45]

Luc Waeijen, Savvas Sioutas, Maurice Peemen, Menno Lindwer, and Henk Cor- poraal. 2021. ConvFusion: A Model for Layer Fusion in Convolutional Neural Networks.IEEE Access9 (2021), 168245–168267. https://doi.org/10.1109/ACCESS. 2021.3134930

work page doi:10.1109/access 2021
[46]

Burc Eryilmaz, Wenqiang Zhang, Yan Liao, Dabin Wu, Stephen Deiss, Bin Gao, Priyanka Raina, Siddharth Joshi, Huaqiang Wu, Gert Cauwenberghs, and H.-S

Weier Wan, Rajkumar Kubendran, S. Burc Eryilmaz, Wenqiang Zhang, Yan Liao, Dabin Wu, Stephen Deiss, Bin Gao, Priyanka Raina, Siddharth Joshi, Huaqiang Wu, Gert Cauwenberghs, and H.-S. Philip Wong. 2020. 33.1 A 74 TMACS/W CMOS-RRAM Neurosynaptic Core with Dynamically Reconfigurable Dataflow and In-situ Transposable Weights for Probabilistic Graphical Model...

work page doi:10.1109/isscc19947.2020.9062979 2020
[47]

Philip Wong, and Gert Cauwenberghs

Weier Wan, Rajkumar Kubendran, Clemens Schaefer, Sukru Burc Eryilmaz, Wen- qiang Zhang, Dabin Wu, Stephen Deiss, Priyanka Raina, He Qian, Bin Gao, Siddharth Joshi, Huaqiang Wu, H.-S. Philip Wong, and Gert Cauwenberghs. 2022. A compute-in-memory chip based on resistive random-access memory.Nature 608, 7923 (Aug. 2022), 504–512. https://doi.org/10.1038/s415...

work page doi:10.1038/s41586-022-04992-8 2022
[48]

Hechen Wang, Renzhi Liu, Richard Dorrance, Deepak Dasalukunte, Dan Lake, and Brent Carlton. 2023. A Charge Domain SRAM Compute-in-Memory Macro With C-2C Ladder-Based 8-Bit MAC Unit in 22-nm FinFET Process for Edge Inference.IEEE Journal of Solid-State Circuits58, 4 (2023), 1037–1050. https: //doi.org/10.1109/JSSC.2022.3232601

work page doi:10.1109/jssc.2022.3232601 2023
[49]

Hechen Wang, Renzhi Liu, Richard Dorrance, Deepak Dasalukunte, Xiaosen Liu, Dan Lake, Brent Carlton, and May Wu. 2022. A 32.2 TOPS/W SRAM Compute- in-Memory Macro Employing a Linear 8-bit C-2C Ladder for Charge Domain Computation in 22nm for Edge Inference. In2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). 36–37. https:...

work page arXiv 2022
[50]

2023.SET Artifacts

Yuchen Wei, Jingwei Cai, Zuotong Wu, Sen Peng, and Kaisheng Ma. 2023.SET Artifacts. https://doi.org/10.5281/zenodo.7751328

work page doi:10.5281/zenodo.7751328 2023
[51]

Emer, and Vivienne Sze

Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. 2019. Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs. In2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8. https://doi.org/10.1109/ICCAD45719.2019.8942149

work page doi:10.1109/iccad45719.2019.8942149 2019
[52]

Nelson Amaral, and Di Niu

Linxuan Zhang, J. Nelson Amaral, and Di Niu. 2025.TransFusion: End-to-End Transformer Acceleration via Graph Fusion and Pipelining. Association for Com- puting Machinery, New York, NY, USA, 1491–1504. https://doi.org/10.1145/ 3725843.3756105

work page arXiv 2025
[53]

Size Zheng, Siyuan Chen, Siyuan Gao, Liancheng Jia, Guangyu Sun, Runsheng Wang, and Yun Liang. 2023. TileFlow: A Framework for Modeling Fusion Dataflow via Tree-Based Analysis. InProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture(<conf-loc>, <city>Toronto</city>, <state>ON</state>, <country>Canada</country>, </conf-loc>)(...

work page doi:10.1145/3613424.3623792 2023
[54]

Zhizhen Zhong, Mingran Yang, Jay Lang, Christian Williams, Liam Kronman, Alexander Sludds, Homa Esfahanizadeh, Dirk Englund, and Manya Ghobadi. 2023. Lightning: A Reconfigurable Photonic-Electronic SmartNIC for Fast and Energy- Efficient Inference. InProceedings of the ACM SIGCOMM 2023 Conference(New York, NY, USA)(ACM SIGCOMM ’23). Association for Comput...

work page doi:10.1145/3603269.3604821 2023