pith. machine review for the scientific record. sign in

arxiv: 2604.07628 · v1 · submitted 2026-04-08 · 💻 cs.AR · cs.ET· cs.NE

Recognition: unknown

Trilinear Compute-in-Memory Architecture for Energy-Efficient Transformer Acceleration

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:49 UTC · model grok-4.3

classification 💻 cs.AR cs.ETcs.NE
keywords compute-in-memorytransformersFeFETself-attentionenergy efficiencynon-volatile memoryneural network acceleration
0
0 comments X

The pith

TrilinearCIM uses back-gate modulation in DG-FeFET devices to perform complete Transformer attention computation in NVM cores without runtime reprogramming.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge that dynamic operands in Transformer self-attention force conventional compute-in-memory accelerators to repeatedly reprogram non-volatile memory, which slows throughput and reduces device endurance. It proposes TrilinearCIM, a DG-FeFET-based design that employs back-gate modulation to implement a three-operand multiply-accumulate operation directly in memory. This primitive keeps the full attention mechanism, including queries, keys, and values, inside the NVM arrays without any ferroelectric reprogramming during inference. Evaluations on BERT-base for GLUE tasks and ViT-base for image classification demonstrate energy and latency gains over standard FeFET CIM approaches, with a moderate area penalty. A sympathetic reader would care because the method removes a key barrier to deploying dense NVM-based accelerators for attention-heavy models on power-limited hardware.

Core claim

TrilinearCIM is a Double-Gate FeFET (DG-FeFET)-based compute-in-memory architecture that uses back-gate modulation to realize a three-operand multiply-accumulate primitive for in-memory attention computation without dynamic ferroelectric reprogramming. To our knowledge, this is the first architecture to perform complete Transformer attention computation exclusively in NVM cores without runtime reprogramming. On BERT-base (GLUE) and ViT-base (ImageNet and CIFAR), it outperforms conventional CIM on seven of nine GLUE tasks while achieving up to 46.6% energy reduction and 20.4% latency improvement at 37.3% area overhead.

What carries the argument

The three-operand multiply-accumulate primitive realized through back-gate modulation of double-gate FeFET devices, which supports the dynamic matrix operations of self-attention entirely within NVM arrays.

Load-bearing premise

That back-gate modulation in DG-FeFET devices can reliably realize accurate three-operand multiply-accumulate operations for attention without introducing errors from device variability or requiring additional calibration.

What would settle it

Hardware measurements on a fabricated TrilinearCIM prototype that compare attention output accuracy against floating-point references under measured device-to-device variations and temperature fluctuations, without extra calibration steps.

Figures

Figures reproduced from arXiv: 2604.07628 by Abhronil Sengupta, Jiahui Duan, Kai Ni, Md Zesun Ahmed Mia.

Figure 1
Figure 1. Figure 1: Transformer Encoder Block architecture. The in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DG-FeFET device concept. (a) Capacitor stack model [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hierarchical organization of the trilinear CIM ac [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: 𝜂𝐵𝐺 vs. 𝐺0 for the DG-FeFET, following Eq. 12 with parameters 𝛼 and 𝑀 extracted from experimental data [16] (see Section 2.2). The green-shaded region indicates the se￾lected operating band (𝐺0 ∈ [29, 69] 𝜇S); the dashed horizontal line marks the band-averaged 𝜂¯𝐵𝐺 = 0.157 V −1 adopted as a uniform constant. with the input sequence. In the conventional dataflow shown in [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 5
Figure 5. Figure 5: Attention dataflow comparison. (a) Conventional: [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: DG-FeFET crossbar array for trilinear MAC. (a) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sub-array size ablation (2b/8b, seq = 128, per infer [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-task GLUE benchmark scores for four bit [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

Self-attention in Transformers generates dynamic operands that force conventional Compute-in-Memory (CIM) accelerators into costly non-volatile memory (NVM) reprogramming cycles, degrading throughput and stressing device endurance. Existing solutions either reduce but retain NVM writes through matrix decomposition or sparsity, or move attention computation to digital CMOS at the expense of NVM density. We present TrilinearCIM, a Double-Gate FeFET (DG-FeFET)-based architecture that uses back-gate modulation to realize a three-operand multiply-accumulate primitive for in-memory attention computation without dynamic ferroelectric reprogramming. Evaluated on BERT-base (GLUE) and ViT-base (ImageNet and CIFAR), TrilinearCIM outperforms conventional CIM on seven of nine GLUE tasks while achieving up to 46.6\% energy reduction and 20.4\% latency improvement over conventional FeFET CIM at 37.3\% area overhead. To our knowledge, this is the first architecture to perform complete Transformer attention computation exclusively in NVM cores without runtime reprogramming.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes TrilinearCIM, a DG-FeFET-based compute-in-memory architecture that exploits back-gate modulation to implement a three-operand multiply-accumulate primitive, enabling complete Transformer self-attention to be executed entirely within NVM cores without runtime reprogramming. It reports up to 46.6% energy reduction and 20.4% latency improvement versus conventional FeFET CIM on BERT-base (GLUE) and ViT-base (ImageNet/CIFAR) at 37.3% area overhead, claiming to be the first such architecture.

Significance. If the device primitive holds, the work would address a key limitation of CIM accelerators for attention mechanisms by removing reprogramming cycles, potentially improving throughput and endurance for Transformer inference. The benchmark gains, if reproducible with realistic variability, would be relevant for energy-constrained hardware acceleration.

major comments (2)
  1. Abstract: the 46.6% energy reduction and 20.4% latency improvement are stated without error bars, baseline implementation details, or methodology for how device non-idealities were incorporated; these figures are load-bearing for the central performance claim and cannot be assessed from the given information.
  2. Device primitive (back-gate modulation for trilinear MAC): no quantitative characterization of linearity, threshold-voltage variation, hysteresis drift, or resulting error rates in the three-operand operation is provided, yet the architecture's ability to perform complete in-NVM attention without reprogramming rests on this assumption remaining below the threshold that would degrade attention scores.
minor comments (2)
  1. Abstract: the statement that the design 'outperforms conventional CIM on seven of nine GLUE tasks' would be clearer if the specific tasks and per-task deltas were listed.
  2. The area overhead figure (37.3%) is given without breakdown (e.g., additional back-gate drivers or sense circuitry), which would aid interpretation of the overall trade-off.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our results and the supporting device characterization. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: Abstract: the 46.6% energy reduction and 20.4% latency improvement are stated without error bars, baseline implementation details, or methodology for how device non-idealities were incorporated; these figures are load-bearing for the central performance claim and cannot be assessed from the given information.

    Authors: We agree that the abstract would benefit from additional context. The full manuscript (Sections 4.1–4.2) specifies the baseline as a conventional single-gate FeFET CIM accelerator with identical array dimensions and peripheral circuitry, and describes how device non-idealities were incorporated via Monte Carlo simulations calibrated to measured DG-FeFET data. The reported maxima are the largest observed improvements across the evaluated benchmarks, while average gains and variability appear in Figures 8–10. To address the concern, we have revised the abstract to include a concise clause noting that results derive from cycle-accurate simulations that include device variations. Full error bars and methodology details remain in the body due to abstract length limits. revision: partial

  2. Referee: Device primitive (back-gate modulation for trilinear MAC): no quantitative characterization of linearity, threshold-voltage variation, hysteresis drift, or resulting error rates in the three-operand operation is provided, yet the architecture's ability to perform complete in-NVM attention without reprogramming rests on this assumption remaining below the threshold that would degrade attention scores.

    Authors: We thank the referee for this observation. Section 3.1 describes the back-gate modulation mechanism and the resulting three-operand MAC, with device models calibrated to experimental DG-FeFET measurements. We acknowledge that explicit quantitative metrics for linearity, Vth variation, hysteresis, and end-to-end error rates were not presented in the main text. In the revised manuscript we have added a new subsection (3.2) and Figure 3 containing measured and simulated data: linearity (R² > 0.98 across the operating range), Vth variation (σ ≈ 50 mV), hysteresis drift after 10⁶ cycles, and the propagated effect on attention scores (average accuracy degradation < 0.5 % on GLUE tasks). These results confirm that the error remains well below the threshold that would necessitate reprogramming or degrade Transformer performance. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture proposal rests on device primitive and benchmarks

full rationale

The paper introduces a DG-FeFET-based TrilinearCIM architecture that realizes three-operand MAC via back-gate modulation for in-NVM Transformer attention without reprogramming. All load-bearing claims (energy/latency gains on GLUE/ImageNet, first-to-complete-attention status) derive from the proposed hardware primitive plus direct benchmark evaluations rather than any equations, fitted parameters renamed as predictions, or self-citation chains. No derivation step reduces by construction to its own inputs; the work is self-contained against external device models and task metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level device primitive.

pith-pipeline@v0.9.0 · 5491 in / 1104 out tokens · 44080 ms · 2026-05-10T16:49:58.613934+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    Fahad Al Mamun, Dragica Vasileska, and Ivan Sanchez Esqueda. 2022. Impact of back-gate biasing on the transport properties of 22 nm FD-SOI MOSFETs at cryogenic temperatures.IEEE Transactions on Electron Devices69, 10 (2022), 5417–5423

  2. [2]

    Stanley Williams, Paolo Faraboschi, Wen-mei W

    Aayush Ankit, Izzat El Hajj, Sai Rahul Chalamalasetti, Geoffrey Ndu, Martin Foltin, R. Stanley Williams, Paolo Faraboschi, Wen-mei W. Hwu, John Paul Strachan, Kaushik Roy, and Dejan Milojicic. 2019. PUMA: A programmable ultra-efficient memristor-based accelerator for machine learning inference. In Proceedings of the Twenty-Fourth International Conference ...

  3. [3]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...

  4. [4]

    An Chen. 2016. A review of emerging non-volatile memory (NVM) technologies and applications.Solid-State Electronics125 (2016), 25–38. https://doi.org/10. 1016/j.sse.2016.07.006

  5. [5]

    Pai-Yu Chen, Xiaochen Peng, and Shimeng Yu. 2018. NeuroSim: A circuit-level macro model for benchmarking neuro-inspired architectures in online learning. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 12 (2018), 3067–3080

  6. [6]

    Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory.ACM SIGARCH Computer Architecture News44, 3 (2016), 27–39

  7. [7]

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255

  8. [8]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186

  9. [9]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations (ICLR)

  10. [10]

    Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W

    Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H. Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W. Lee, and Deog- Kyoon Jeong. 2020. A3: Accelerating Attention Mechanisms in Neural Networks with Approximation. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 328–341

  11. [11]

    Hung-Chi Han, Farzan Jazaeri, Antonio D’Amico, Zhixing Zhao, Steffen Lehmann, Claudia Kretzschmar, Edoardo Charbon, and Christian Enz. 2022. Back-gate effects on DC performance and carrier transport in 22 nm FDSOI technology down to cryogenic temperatures.Solid-State Electronics193 (2022), 108296

  12. [12]

    Dan Hendrycks and Kevin Gimpel. 2016. Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415(2016)

  13. [13]

    Mark Horowitz. 2014. 1.1 computing’s energy problem (and what we can do about it). In2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC). IEEE, 10–14

  14. [14]

    Daniele Ielmini and H-S Philip Wong. 2018. In-memory computing with resistive switching devices.Nature electronics1, 6 (2018), 333–343

  15. [15]

    Matthew Jerry, Pai-Yu Chen, Jianchi Zhang, Pankaj Sharma, Kai Ni, Shimeng Yu, and Suman Datta. 2017. Ferroelectric FET analog synapse for acceleration of deep neural network training. In2017 IEEE international electron devices meeting (IEDM). IEEE, 6–2

  16. [16]

    Zhouhang Jiang, A. N. M. Nafiul Islam, Zhuangyu Han, Zijian Zhao, Franz Müller, Jiahui Duan, Halid Mulaosmanovic, Stefan Dünkel, Sven Beyer, Sourav Dutta, Vijaykrishnan Narayanan, Thomas Kämpfe, Suma George Cardwell, Frances Chance, Abhronil Sengupta, and Kai Ni. 2025. A Bio-inspired Asymmetric Double-Gate Ferroelectric FET for Emulating Astrocyte and Den...

  17. [17]

    Zhouhang Jiang, Yi Xiao, Swetaki Chatterjee, Halid Mulaosmanovic, Stefan Duenkel, Steven Soss, Sven Beyer, Rajiv Joshi, Yogesh Singh Chauhan, Hussam Amrouch, Vijaykrishnan Narayanan, and Kai Ni. 2022. Asymmetric double-gate ferroelectric FET to decouple the tradeoff between thickness scaling and mem- ory window. In2022 IEEE Symposium on VLSI Technology an...

  18. [18]

    Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009)

  19. [19]

    Ann Franchesca Laguna, Mohammed Mehdi Sharifi, Arman Kazemi, Xunzhao Yin, Michael Niemier, and X Sharon Hu. 2022. Hardware-software co-design of an in-memory transformer network accelerator.Frontiers in Electronics3 (2022), 847069

  20. [20]

    Zhikai Li, Junrui Xiao, Lianwei Yang, and Qingyi Gu. 2023. RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 17227–17236

  21. [21]

    Hyung-Kyu Lim and Jerry G Fossum. 1983. Threshold voltage of thin-film silicon- on-insulator (SOI) MOSFET’s.IEEE Transactions on Electron Devices30, 10 (1983), 1244–1251

  22. [22]

    Yang Lin, Tianyu Zhang, Peiqin Sun, Zheng Li, and Shuchang Zhou. 2022. FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer. InProceed- ings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI). 1173–1179

  23. [23]

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 10012–10022

  24. [24]

    Lizi Lu, Jasyuant Xie, Zhenhua Huang, and Huazhong Wang. 2021. Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture. In2021 54th Annual IEEE/ACM International Symposium on Microar- chitecture (MICRO). 977–991

  25. [25]

    Halid Mulaosmanovic, Evelyn T Breyer, Thomas Mikolajick, and Stefan Slesazeck

  26. [26]

    Ferroelectric FETs with 20-nm-thick HfO 2 layer for large memory window and high performance.IEEE Transactions on Electron Devices66, 9 (2019), 3828– 3833

  27. [27]

    Halid Mulaosmanovic, Dominik Kleimaier, Stefan Dünkel, Sven Beyer, Thomas Mikolajick, and Stefan Slesazeck. 2021. Ferroelectric transistors with asymmetric double gate for memory window exceeding 12 V and disturb-free read.Nanoscale 13, 38 (2021), 16258–16266

  28. [28]

    Kai Ni, Xunzhao Yin, Ann Franchesca Laguna, Siddharth Joshi, Stefan Dünkel, Martin Trentzsch, Johannes Müller, Sven Beyer, Michael Niemier, Xiaobo Sharon Hu, and Suman Datta. 2019. Ferroelectric ternary content-addressable memory for one-shot learning.Nature Electronics2, 11 (2019), 521–529

  29. [29]

    Olivier Nier, Denis Rideau, Yann-Michel Niquet, Frédéric Monsieur, Viet Hung Nguyen, François Triozon, Antoine Cros, Raphael Clerc, Jean-Christophe Barbé, Pierpaolo Palestri, David Esseni, Ivan Duchemin, Lee Smith, Luca Silvestri, Frédéric Nallet, Christine Tavernier, Hugues Jaouen, and Luca Selmi. 2013. Multi- scale strategy for high-k/metal-gate UTBB-FD...

  30. [30]

    Xiaochen Peng, Shanshi Huang, Yandong Luo, Xiaoyu Sun, and Shimeng Yu

  31. [31]

    In2019 IEEE international electron devices meeting (IEDM)

    DNN+ NeuroSim: An end-to-end benchmarking framework for compute-in- memory accelerators with versatile device technologies. In2019 IEEE international electron devices meeting (IEDM). IEEE, 32–5

  32. [32]

    Dayane Reis, Ann Franchesca Laguna, Michael Niemier, and Xiaobo Sharon Hu

  33. [33]

    InProceedings of the 26th Asia and South Pacific Design Automation Conference

    Attention-in-memory for few-shot learning with configurable ferroelectric FET arrays. InProceedings of the 26th Asia and South Pacific Design Automation Conference. 49–54

  34. [34]

    Abu Sebastian, Manuel Le Gallo, Riduan Khaddam-Aljameh, and Evangelos Eleftheriou. 2020. Memory devices and applications for in-memory computing. Nature nanotechnology15, 7 (2020), 529–544

  35. [35]

    Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars.ACM SIGARCH Computer Architecture News44, 3 (2016), 14–26

  36. [36]

    Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. 2017. Pipelayer: A pipelined reram-based accelerator for deep learning. In2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 541–552

  37. [37]

    Shrihari Sridharan, Jacob R Stevens, Kaushik Roy, and Anand Raghunathan. 2023. X-Former: In-memory acceleration of transformers.IEEE Transactions on Very Large Scale Integration (VLSI) Systems31, 8 (2023), 1223–1233

  38. [38]

    Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning in NLP. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3645–3650

  39. [39]

    Mengshu Sun, Haoyu Ma, Guoliang Liu, Yancheng Jiang, Tianhao Shen, and Bei Yu. 2022. VAQF: Fully automatic software-hardware co-design framework for low-bit vision transformer.arXiv preprint arXiv:2201.06618(2022)

  40. [40]

    Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2022. Efficient Transformers: A Survey.ACM Comput. Surv.55, 6, Article 109 (Dec. 2022), 28 pages. 12

  41. [41]

    Van Beek, K

    S. Van Beek, K. Cai, F. Yasin, H. Hody, G. Talmelli, V. D. Nguyen, N. Franchina Vergel, A. Palomino, A. Trovato, K. Wostyn, S. Rao, G. S. Kar, and S. Couet. 2023. Scaling the SOT track – A path towards maximizing efficiency in SOT-MRAM. In2023 International Electron Devices Meeting (IEDM). IEEE, 1–4. https://doi.org/10.1109/IEDM45741.2023.10413749

  42. [42]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 30

  43. [43]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP. 353–355

  44. [44]

    Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7675–7688

  45. [45]

    Hanrui Wang, Zhekai Zhang, and Song Han. 2021. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 97–110

  46. [46]

    Chao Yang, Xiaoping Wang, and Zhigang Zeng. 2022. Full-circuit implementation of transformer network based on memristor.IEEE Transactions on Circuits and Systems I: Regular Papers69, 4 (2022), 1395–1407. https://doi.org/10.1109/TCSI. 2021.3136355

  47. [47]

    Xiaoxuan Yang, Bonan Yan, Hai Li, and Yiran Chen. 2020. ReTransformer: ReRAM- based processing-in-memory architecture for transformer acceleration. InPro- ceedings of the 39th International Conference on Computer-Aided Design. 1–9

  48. [48]

    Xunzhao Yin, Xiaoming Chen, Michael Niemier, and Xiaobo Sharon Hu. 2018. Ferroelectric FETs-based nonvolatile logic-in-memory circuits.IEEE Transactions on Very Large Scale Integration (VLSI) Systems27, 1 (2018), 159–172

  49. [49]

    Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, and Guangyu Sun. 2022. PTQ4ViT: Post-Training Quantization for Vision Transformers with Twin Uni- form Quantization. InEuropean Conference on Computer Vision (ECCV). 191–207

  50. [50]

    Xuan Zhang, Zhuoran Song, Xing Li, Zhezhi He, Naifeng Jing, Li Jiang, and Xiaoyao Liang. 2024. Watt: A Write-Optimized RRAM-Based Accelerator for Attention. InEuropean Conference on Parallel Processing (Euro-Par). Springer, 111–125

  51. [51]

    Minxuan Zhou, Weihong Xu, Jaeyoung Kang, and Tajana Rosing. 2022. Transpim: A memory-based acceleration via software-hardware co-design for transformer. In2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 1071–1085

  52. [52]

    Zhe Zhou, Junlin Liu, Zhenyu Gu, and Guangyu Sun. 2022. Energon: Toward efficient acceleration of transformers using dynamic sparse attention.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems42, 1 (2022), 136–149. 13