Recognition: no theorem link
Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Design
Pith reviewed 2026-05-15 21:29 UTC · model grok-4.3
The pith
FFM prunes partial mappings that cannot lead to optimal fusion to search the full fused mapspace for tensor accelerators in feasible time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FFM shrinks the search space by pruning subsets of mappings that are shown to never be a part of optimal mappings, quickly eliminating all suboptimal mappings containing those partial mappings, then joins the surviving partial mappings to construct optimal fused mappings.
What carries the argument
The Fast and Fusiest Mapper (FFM) pruning rules on partial mappings, which remove any partial schedule proven never to participate in a globally optimal fused mapping before the remaining pieces are joined into complete schedules.
Load-bearing premise
The pruning rules correctly eliminate only those partial mappings that can never belong to an optimal fused schedule for the given workload shapes and fusion semantics.
What would settle it
A workload in which an optimal fused mapping is known to contain a partial mapping that FFM's rules prune would falsify the pruning correctness claim.
Figures
read the original abstract
A low-latency and energy-efficient tensor algebra accelerator design must optimize how data movement and operations are scheduled (i.e., mapped) in the accelerator architecture. A key mapping optimization is fusion, meaning holding data on-chip between computation steps in the workload, which has been shown to reduce energy and latency by reducing expensive off-chip data movement. However, the optimal fusion choice depends on the workload and workload shape, and a mapper, which searches for the optimal mapping, can improve energy and latency significantly. However, prior mappers cannot find optimal mappings with fusion (i.e., fused mappings) in a feasible runtime because the number of fused mappings to search increases exponentially with the number of computation steps in the workload. In this paper, we introduce the Fast and Fusiest Mapper (FFM), a mapper to quickly find optimal mappings in a comprehensive fused mapspace for tensor algebra workloads. FFM shrinks the search space by pruning subsets of mappings (i.e., partial mappings) that are shown to never be a part of optimal mappings, quickly eliminating all suboptimal mappings containing those partial mappings. Then FFM joins partial mappings to construct optimal fused mappings. Using FFM, we demonstrate an energy-delay-product (EDP) reduction by up to $1.8\times$ compared to TransFusion, a state-of-the-art accelerator with hand-optimized fusion. Moreover, we show that FFM finds mappings orders of magnitude faster ($>10,000\times$) than prior automated mappers TileFlow and SET, and given the same runtime, reduces EDP by $>2\times$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Fast and Fusiest Mapper (FFM), which prunes partial mappings shown to never appear in optimal fused mappings for tensor algebra workloads on accelerators, then joins the surviving partial mappings to construct complete optimal fused mappings. It claims this enables exhaustive search in a comprehensive fused mapspace, yielding up to 1.8× EDP reduction versus the hand-optimized TransFusion accelerator and >10,000× faster mapping than TileFlow and SET (with >2× EDP improvement at equal runtime).
Significance. If the pruning predicate is sound, FFM would make optimal fusion-aware mapping tractable for workloads with many computation steps, directly addressing the exponential mapspace growth that limits prior automated mappers. This could improve automated design-space exploration for energy-efficient tensor accelerators beyond current hand-tuned or heuristic approaches.
major comments (3)
- [§3] §3 (Pruning Rules): The central claim that the pruning rules eliminate only mappings that 'can never be a part of optimal mappings' is presented without a formal invariant, proof sketch, or exhaustive enumeration on representative workloads; the rules depend on unstated assumptions about fusion semantics, dataflow reuse, and cost-model monotonicity that must hold for arbitrary tensor-algebra patterns.
- [§5] §5 (Experimental Results): Concrete claims of 1.8× EDP reduction and >10,000× speedup are reported, yet the manuscript supplies no details on workload selection, accelerator parameters, cost-model implementation, or any validation that the pruned mapspace still contains the true optimum; this leaves the performance numbers unsupported and non-reproducible from the given text.
- [§4.2] §4.2 (Joining Step): The joining procedure that reconstructs complete fused mappings from pruned partials is described at a high level but does not specify how it guarantees completeness (i.e., that every optimal fused mapping is recovered) once pruning has occurred.
minor comments (2)
- [Abstract] Abstract: The title uses 'Fusiest' but the term is never defined or motivated in the body; a brief clarification of the intended meaning would improve readability.
- [Figure 2] Figure 2: The diagram of the pruning and joining process lacks labels on the partial-mapping nodes and does not indicate which pruning rules are applied at each step.
Simulated Author's Rebuttal
We thank the referee for the thorough and constructive review. We address each major comment below and will revise the manuscript to strengthen the formal arguments, add missing experimental details, and clarify the joining procedure.
read point-by-point responses
-
Referee: [§3] §3 (Pruning Rules): The central claim that the pruning rules eliminate only mappings that 'can never be a part of optimal mappings' is presented without a formal invariant, proof sketch, or exhaustive enumeration on representative workloads; the rules depend on unstated assumptions about fusion semantics, dataflow reuse, and cost-model monotonicity that must hold for arbitrary tensor-algebra patterns.
Authors: We agree that the current presentation lacks a formal invariant and proof sketch. In the revised manuscript we will add to §3 a proof sketch based on cost-model monotonicity (any increase in partial cost cannot be offset by later fusion choices) together with the standard fusion semantics that fused steps must share on-chip buffers without intermediate DRAM writes. We will also report exhaustive enumeration results on representative small workloads (chains of 3–5 GEMMs and CONVs) confirming that no optimal mapping is eliminated. The assumptions on fusion semantics, dataflow reuse, and monotonicity will be stated explicitly at the start of the section. revision: yes
-
Referee: [§5] §5 (Experimental Results): Concrete claims of 1.8× EDP reduction and >10,000× speedup are reported, yet the manuscript supplies no details on workload selection, accelerator parameters, cost-model implementation, or any validation that the pruned mapspace still contains the true optimum; this leaves the performance numbers unsupported and non-reproducible from the given text.
Authors: We accept that the experimental section is insufficiently detailed for reproducibility. The revised §5 will enumerate the exact workloads (MLPerf-derived GEMM/CONV shapes plus synthetic chains), accelerator parameters (PE array size, buffer capacities, NoC bandwidth), and cost-model implementation (Timeloop/Accelergy energy and latency tables). We will also add a validation subsection that compares FFM against exhaustive search on reduced mapspaces to confirm the pruned space retains the true optimum, thereby supporting the reported 1.8× EDP and >10,000× speedup figures. revision: yes
-
Referee: [§4.2] §4.2 (Joining Step): The joining procedure that reconstructs complete fused mappings from pruned partials is described at a high level but does not specify how it guarantees completeness (i.e., that every optimal fused mapping is recovered) once pruning has occurred.
Authors: We will expand §4.2 with pseudocode for the join operation and an explicit completeness argument: because the pruning rules of §3 remove only partial mappings that cannot participate in any optimal solution, every optimal complete mapping is composed exclusively of surviving partials; the join enumerates all compatible combinations of those partials and therefore recovers every optimal fused mapping. A short inductive argument will be included to show that no optimal mapping is lost. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's core mechanism prunes partial mappings claimed to never appear in optimal fused mappings before joining survivors to form complete mappings. This is presented as an analysis of optimality invariants for tensor algebra operations and the accelerator cost model, with results validated empirically against external baselines (TransFusion, TileFlow, SET). No equations or steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the pruning predicate is asserted from workload properties rather than derived from the target result itself. The derivation remains self-contained and externally falsifiable via the reported EDP and runtime comparisons.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12. https://doi.org/10.1109/MICRO.2016.7783725
-
[2]
Tanner Andrulis. [n. d.].HWComponents. https://github.com/Accelergy-Project/ hwcomponents
-
[3]
Tanner Andrulis, Joel S. Emer, and Vivienne Sze. 2023. RAELLA: Reform- ing the Arithmetic for Efficient, Low-Resolution, and Low-Loss Analog PIM: No Retraining Required!. InProceedings of the 50th Annual International Sym- posium on Computer Architecture(Orlando, FL, USA)(ISCA ’23). Association for Computing Machinery, New York, NY, USA, Article 27, 16 pa...
-
[4]
Tanner Andrulis, Joel S. Emer, and Vivienne Sze. 2024. CiMLoop: A Flexible, Accurate, and Fast Compute-In-Memory Modeling Tool. In2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 10–23. https://doi.org/10.1109/ISPASS61541.2024.00012
-
[5]
Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas
Rajeev Balasubramonian, Andrew B. Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. 2017. CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories.ACM Trans. Archit. Code Optim.14, 2, Article 14 (June 2017), 25 pages. https://doi.org/10.1145/3085572
-
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott ...
work page 2020
-
[7]
Jingwei Cai, Yuchen Wei, Zuotong Wu, Sen Peng, and Kaisheng Ma. 2023. Inter- Layer Scheduling Space Definition and Exploration for Tiled Accelerators. In Proceedings of the 50th Annual International Symposium on Computer Architecture (Orlando, FL, USA)(ISCA ’23). Association for Computing Machinery, New York, NY, USA, Article 13, 17 pages. https://doi.org...
-
[8]
Xuyi Cai, Ying Wang, and Lei Zhang. 2022. Optimus: An Operator Fusion Framework for Deep Neural Networks.ACM Trans. Embed. Comput. Syst.22, 1, Article 1 (oct 2022), 26 pages. https://doi.org/10.1145/3520142
-
[9]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsba...
work page 2018
-
[10]
https://www.usenix.org/conference/osdi18/presentation/chen
-
[11]
Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architec- ture for Energy-Efficient Dataflow for Convolutional Neural Networks. In2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 367–379. https://doi.org/10.1109/ISCA.2016.40
-
[12]
Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InThe Twelfth International Conference on Learning Represen- tations. https://openreview.net/forum?id=mZn2Xyh9Ec
work page 2024
-
[13]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv e-prints, Article arXiv:2205.14135 (May 2022), arXiv:2205.14135 pages. https: //doi.org/10.48550/arXiv.2205.14135 arXiv:2205.14135 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.14135 2022
-
[14]
A. Einstein. 1916. The Foundation of the General Theory of Relativity.Annalen der Physik354, 7 (1916), 769–822. https://doi.org/10.1002/andp.19163540702
-
[15]
2023.LoopTree: Enabling Systematic and Flexible Exploration of Fused-layer Dataflow Accelerators
Michael Gilbert. 2023.LoopTree: Enabling Systematic and Flexible Exploration of Fused-layer Dataflow Accelerators. PhD thesis. Massachusetts Institute of Technology, Cambridge, MA
work page 2023
-
[16]
Michael Gilbert, Tanner Andrulis, Vivienne Sze, and Joel S. Emer. 2026. The Turbo-Charged Mapper: Fast and Optimal Mapping for Accelerator Modeling and Evaluation. arXiv:2602.15172 [cs.AR] https://arxiv.org/abs/2602.15172
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Michael Gilbert, Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. 2024. Loop- Tree: Exploring the Fused-Layer Dataflow Accelerator Design Space.IEEE Transactions on Circuits and Systems for Artificial Intelligence1, 1 (2024), 97–
work page 2024
-
[18]
https://doi.org/10.1109/TCASAI.2024.3461716
- [19]
-
[20]
Google. 2022. XLA: Optimizing Compiler for Machine Learning. https://www. tensorflow.org/xla
work page 2022
-
[21]
Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. InProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture(Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New ...
-
[22]
Kartik Hegde, Po-An Tsai, Sitao Huang, Vikas Chandra, Angshuman Parashar, and Christopher W. Fletcher. 2021. Mind Mappings: Enabling Efficient Algorithm- Accelerator Mapping Space Search. InIn Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21). IEEE
work page 2021
-
[23]
Mark Horeni, Pooria Taheri, Po-An Tsai, Angshuman Parashar, Joel Emer, and Siddharth Joshi. 2022. Ruby: Improving Hardware Efficiency for Tensor Alge- bra Accelerators Through Imperfect Factorization. In2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 254–266. https://doi.org/10.1109/ISPASS55109.2022.00039
-
[24]
Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. 2021. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 554–566. https: //doi.org/10.1109/ISCA52012.2021.00050
-
[25]
Hongyang Jia, Hossein Valavi, Yinqi Tang, Jintao Zhang, and Naveen Verma
-
[26]
https://doi.org/10.1109/JSSC.2020.2987714
A Programmable Heterogeneous Microprocessor Based on Bit-Scalable In- Memory Computing.IEEE Journal of Solid-State Circuits55, 9 (2020), 2609–2621. https://doi.org/10.1109/JSSC.2020.2987714
-
[27]
Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B
Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. 2021. Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product. In2021 ACM/IEEE 48th Annual Intern...
-
[28]
Sheng-Chun Kao and Tushar Krishna. 2020. GAMMA: Automating the HW Map- ping of DNN Models on Accelerators via Genetic Algorithm. In2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9
work page 2020
-
[29]
Sheng-Chun Kao, Suvinay Subramanian, Gaurav Agrawal, Amir Yazdanbakhsh, and Tushar Krishna. 2021. FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks. https://doi.org/10.48550/ARXIV.2107.06419
-
[30]
Hyunjoon Kim, Taegeun Yoo, Tony Tae-Hyoung Kim, and Bongjin Kim. 2021. Colonnade: A Reconfigurable SRAM-Based Digital Bit-Serial Compute-In- Memory Macro for Processing Neural Networks.IEEE Journal of Solid-State Circuits56, 7 (2021), 2221–2233. https://doi.org/10.1109/JSSC.2021.3061508
-
[31]
Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Ama- rasinghe. 2017. Taco: A tool to generate tensor algebra kernels. In2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). 943–948. https://doi.org/10.1109/ASE.2017.8115709
-
[32]
Hyoukjun Kwon, Prasanth Chatarasi, Vivek Sarkar, Tushar Krishna, Michael Pellauer, and Angshuman Parashar. 2020. MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings.IEEE Micro40, 3 (2020), 20–29. https://doi.org/10.1109/MM.2020.2985963 12 Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Desi...
-
[33]
Marco Laumanns, Lothar Thiele, Kalyanmoy Deb, and Eckart Zitzler. 2002. Com- bining Convergence and Diversity in Evolutionary Multiobjective Optimiza- tion.Evolutionary Computation10, 3 (2002), 263–282. https://doi.org/10.1162/ 106365602760234108
work page 2002
-
[34]
L. Mei, K. Goetschalckx, A. Symons, and M. Verhelst. 2023. DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators through Analytical Modeling. In2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 570–583. https://doi.org/10.1109/HPCA56546.20...
-
[35]
Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. 2021. ZigZag: Enlarging Joint Architecture-Mapping Design Space Exploration for DNN Accelerators.IEEE Trans. Comput.70, 8 (2021), 1160–1174. https://doi.org/10.1109/TC.2021.3059962
-
[36]
Microsoft. 2025. Develop AI applications for Copilot+ PCs. https://learn.microsoft. com/en-us/windows/ai/npu-devices/. Accessed: 2026-04-06
work page 2025
-
[37]
Odemuyiwa, Shubham Ugare, Christopher W
Nandeeka Nayak, Toluwanimi O. Odemuyiwa, Shubham Ugare, Christopher W. Fletcher, Michael Pellauer, and Joel S. Emer. 2023. TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators.arXiv e-prints, Article arXiv:2304.07931 (April 2023), arXiv:2304.07931 pages. https://doi.org/10.48550/ arXiv.2304.07931 arXiv:2304.07931 [cs.AR]
-
[38]
Odemuyiwa, Michael Pellauer, Joel S
Nandeeka Nayak, Xinrui Wu, Toluwanimi O. Odemuyiwa, Michael Pellauer, Joel S. Emer, and Christopher W. Fletcher. 2024. FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design. InProceedings of the 57th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’24). Association for Computing Machinery, New York, NY, USA
work page 2024
-
[39]
The EDGE Language: Extended General Einsums for Graph Algorithms
Toluwanimi O. Odemuyiwa, Joel S. Emer, and John D. Owens. 2024. The EDGE Language: Extended General Einsums for Graph Algorithms.arXiv e-prints, Article arXiv:2404.11591 (April 2024), arXiv:2404.11591 pages. https://doi.org/10. 48550/arXiv.2404.11591 arXiv:2404.11591 [cs.DS]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W
Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In2019 IEEE International Symposium on Perfor- mance Analysis of Systems and Software (ISPASS). 304–315. https:...
-
[41]
Stanley Williams, and Vivek Srikumar
Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arith- metic in Crossbars. In2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 14–26. https://doi.org/10.1...
-
[42]
Kyle Shiflett, Avinash Karanth, Razvan Bunescu, and Ahmed Louri. 2021. Al- bireo: energy-efficient acceleration of convolutional neural networks via sili- con photonics. InProceedings of the 48th Annual International Symposium on Computer Architecture(Virtual Event, Spain)(ISCA ’21). IEEE Press, 860–873. https://doi.org/10.1109/ISCA52012.2021.00072
-
[43]
Mahmut E. Sinangil, Burak Erbagci, Rawan Naous, Kerem Akarvardar, Dar Sun, Win-San Khwa, Hung-Jen Liao, Yih Wang, and Jonathan Chang. 2021. A 7- nm Compute-in-Memory SRAM Macro Supporting Multi-Bit Input, Weight and Output and Achieving 351 TOPS/W and 372.4 GOPS.IEEE Journal of Solid-State Circuits56, 1 (2021), 188–198. https://doi.org/10.1109/JSSC.2020.3031290
-
[44]
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2020. Efficient Processing of Deep Neural Networks.Synthesis Lectures on Computer Architec- ture15, 2 (2020), 1–341. https://doi.org/10.2200/S01004ED1V01Y202004CAC050 arXiv:https://doi.org/10.2200/S01004ED1V01Y202004CAC050
-
[45]
Luc Waeijen, Savvas Sioutas, Maurice Peemen, Menno Lindwer, and Henk Cor- poraal. 2021. ConvFusion: A Model for Layer Fusion in Convolutional Neural Networks.IEEE Access9 (2021), 168245–168267. https://doi.org/10.1109/ACCESS. 2021.3134930
-
[46]
Weier Wan, Rajkumar Kubendran, S. Burc Eryilmaz, Wenqiang Zhang, Yan Liao, Dabin Wu, Stephen Deiss, Bin Gao, Priyanka Raina, Siddharth Joshi, Huaqiang Wu, Gert Cauwenberghs, and H.-S. Philip Wong. 2020. 33.1 A 74 TMACS/W CMOS-RRAM Neurosynaptic Core with Dynamically Reconfigurable Dataflow and In-situ Transposable Weights for Probabilistic Graphical Model...
-
[47]
Philip Wong, and Gert Cauwenberghs
Weier Wan, Rajkumar Kubendran, Clemens Schaefer, Sukru Burc Eryilmaz, Wen- qiang Zhang, Dabin Wu, Stephen Deiss, Priyanka Raina, He Qian, Bin Gao, Siddharth Joshi, Huaqiang Wu, H.-S. Philip Wong, and Gert Cauwenberghs. 2022. A compute-in-memory chip based on resistive random-access memory.Nature 608, 7923 (Aug. 2022), 504–512. https://doi.org/10.1038/s415...
-
[48]
Hechen Wang, Renzhi Liu, Richard Dorrance, Deepak Dasalukunte, Dan Lake, and Brent Carlton. 2023. A Charge Domain SRAM Compute-in-Memory Macro With C-2C Ladder-Based 8-Bit MAC Unit in 22-nm FinFET Process for Edge Inference.IEEE Journal of Solid-State Circuits58, 4 (2023), 1037–1050. https: //doi.org/10.1109/JSSC.2022.3232601
-
[49]
Hechen Wang, Renzhi Liu, Richard Dorrance, Deepak Dasalukunte, Xiaosen Liu, Dan Lake, Brent Carlton, and May Wu. 2022. A 32.2 TOPS/W SRAM Compute- in-Memory Macro Employing a Linear 8-bit C-2C Ladder for Charge Domain Computation in 22nm for Edge Inference. In2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). 36–37. https:...
-
[50]
Yuchen Wei, Jingwei Cai, Zuotong Wu, Sen Peng, and Kaisheng Ma. 2023.SET Artifacts. https://doi.org/10.5281/zenodo.7751328
-
[51]
Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. 2019. Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs. In2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8. https://doi.org/10.1109/ICCAD45719.2019.8942149
-
[52]
Linxuan Zhang, J. Nelson Amaral, and Di Niu. 2025.TransFusion: End-to-End Transformer Acceleration via Graph Fusion and Pipelining. Association for Com- puting Machinery, New York, NY, USA, 1491–1504. https://doi.org/10.1145/ 3725843.3756105
-
[53]
Size Zheng, Siyuan Chen, Siyuan Gao, Liancheng Jia, Guangyu Sun, Runsheng Wang, and Yun Liang. 2023. TileFlow: A Framework for Modeling Fusion Dataflow via Tree-Based Analysis. InProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture(<conf-loc>, <city>Toronto</city>, <state>ON</state>, <country>Canada</country>, </conf-loc>)(...
-
[54]
Zhizhen Zhong, Mingran Yang, Jay Lang, Christian Williams, Liam Kronman, Alexander Sludds, Homa Esfahanizadeh, Dirk Englund, and Manya Ghobadi. 2023. Lightning: A Reconfigurable Photonic-Electronic SmartNIC for Fast and Energy- Efficient Inference. InProceedings of the ACM SIGCOMM 2023 Conference(New York, NY, USA)(ACM SIGCOMM ’23). Association for Comput...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.