Recognition: unknown
Trilinear Compute-in-Memory Architecture for Energy-Efficient Transformer Acceleration
Pith reviewed 2026-05-10 16:49 UTC · model grok-4.3
The pith
TrilinearCIM uses back-gate modulation in DG-FeFET devices to perform complete Transformer attention computation in NVM cores without runtime reprogramming.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TrilinearCIM is a Double-Gate FeFET (DG-FeFET)-based compute-in-memory architecture that uses back-gate modulation to realize a three-operand multiply-accumulate primitive for in-memory attention computation without dynamic ferroelectric reprogramming. To our knowledge, this is the first architecture to perform complete Transformer attention computation exclusively in NVM cores without runtime reprogramming. On BERT-base (GLUE) and ViT-base (ImageNet and CIFAR), it outperforms conventional CIM on seven of nine GLUE tasks while achieving up to 46.6% energy reduction and 20.4% latency improvement at 37.3% area overhead.
What carries the argument
The three-operand multiply-accumulate primitive realized through back-gate modulation of double-gate FeFET devices, which supports the dynamic matrix operations of self-attention entirely within NVM arrays.
Load-bearing premise
That back-gate modulation in DG-FeFET devices can reliably realize accurate three-operand multiply-accumulate operations for attention without introducing errors from device variability or requiring additional calibration.
What would settle it
Hardware measurements on a fabricated TrilinearCIM prototype that compare attention output accuracy against floating-point references under measured device-to-device variations and temperature fluctuations, without extra calibration steps.
Figures
read the original abstract
Self-attention in Transformers generates dynamic operands that force conventional Compute-in-Memory (CIM) accelerators into costly non-volatile memory (NVM) reprogramming cycles, degrading throughput and stressing device endurance. Existing solutions either reduce but retain NVM writes through matrix decomposition or sparsity, or move attention computation to digital CMOS at the expense of NVM density. We present TrilinearCIM, a Double-Gate FeFET (DG-FeFET)-based architecture that uses back-gate modulation to realize a three-operand multiply-accumulate primitive for in-memory attention computation without dynamic ferroelectric reprogramming. Evaluated on BERT-base (GLUE) and ViT-base (ImageNet and CIFAR), TrilinearCIM outperforms conventional CIM on seven of nine GLUE tasks while achieving up to 46.6\% energy reduction and 20.4\% latency improvement over conventional FeFET CIM at 37.3\% area overhead. To our knowledge, this is the first architecture to perform complete Transformer attention computation exclusively in NVM cores without runtime reprogramming.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TrilinearCIM, a DG-FeFET-based compute-in-memory architecture that exploits back-gate modulation to implement a three-operand multiply-accumulate primitive, enabling complete Transformer self-attention to be executed entirely within NVM cores without runtime reprogramming. It reports up to 46.6% energy reduction and 20.4% latency improvement versus conventional FeFET CIM on BERT-base (GLUE) and ViT-base (ImageNet/CIFAR) at 37.3% area overhead, claiming to be the first such architecture.
Significance. If the device primitive holds, the work would address a key limitation of CIM accelerators for attention mechanisms by removing reprogramming cycles, potentially improving throughput and endurance for Transformer inference. The benchmark gains, if reproducible with realistic variability, would be relevant for energy-constrained hardware acceleration.
major comments (2)
- Abstract: the 46.6% energy reduction and 20.4% latency improvement are stated without error bars, baseline implementation details, or methodology for how device non-idealities were incorporated; these figures are load-bearing for the central performance claim and cannot be assessed from the given information.
- Device primitive (back-gate modulation for trilinear MAC): no quantitative characterization of linearity, threshold-voltage variation, hysteresis drift, or resulting error rates in the three-operand operation is provided, yet the architecture's ability to perform complete in-NVM attention without reprogramming rests on this assumption remaining below the threshold that would degrade attention scores.
minor comments (2)
- Abstract: the statement that the design 'outperforms conventional CIM on seven of nine GLUE tasks' would be clearer if the specific tasks and per-task deltas were listed.
- The area overhead figure (37.3%) is given without breakdown (e.g., additional back-gate drivers or sense circuitry), which would aid interpretation of the overall trade-off.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our results and the supporting device characterization. We address each major comment below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: Abstract: the 46.6% energy reduction and 20.4% latency improvement are stated without error bars, baseline implementation details, or methodology for how device non-idealities were incorporated; these figures are load-bearing for the central performance claim and cannot be assessed from the given information.
Authors: We agree that the abstract would benefit from additional context. The full manuscript (Sections 4.1–4.2) specifies the baseline as a conventional single-gate FeFET CIM accelerator with identical array dimensions and peripheral circuitry, and describes how device non-idealities were incorporated via Monte Carlo simulations calibrated to measured DG-FeFET data. The reported maxima are the largest observed improvements across the evaluated benchmarks, while average gains and variability appear in Figures 8–10. To address the concern, we have revised the abstract to include a concise clause noting that results derive from cycle-accurate simulations that include device variations. Full error bars and methodology details remain in the body due to abstract length limits. revision: partial
-
Referee: Device primitive (back-gate modulation for trilinear MAC): no quantitative characterization of linearity, threshold-voltage variation, hysteresis drift, or resulting error rates in the three-operand operation is provided, yet the architecture's ability to perform complete in-NVM attention without reprogramming rests on this assumption remaining below the threshold that would degrade attention scores.
Authors: We thank the referee for this observation. Section 3.1 describes the back-gate modulation mechanism and the resulting three-operand MAC, with device models calibrated to experimental DG-FeFET measurements. We acknowledge that explicit quantitative metrics for linearity, Vth variation, hysteresis, and end-to-end error rates were not presented in the main text. In the revised manuscript we have added a new subsection (3.2) and Figure 3 containing measured and simulated data: linearity (R² > 0.98 across the operating range), Vth variation (σ ≈ 50 mV), hysteresis drift after 10⁶ cycles, and the propagated effect on attention scores (average accuracy degradation < 0.5 % on GLUE tasks). These results confirm that the error remains well below the threshold that would necessitate reprogramming or degrade Transformer performance. revision: yes
Circularity Check
No circularity: architecture proposal rests on device primitive and benchmarks
full rationale
The paper introduces a DG-FeFET-based TrilinearCIM architecture that realizes three-operand MAC via back-gate modulation for in-NVM Transformer attention without reprogramming. All load-bearing claims (energy/latency gains on GLUE/ImageNet, first-to-complete-attention status) derive from the proposed hardware primitive plus direct benchmark evaluations rather than any equations, fitted parameters renamed as predictions, or self-citation chains. No derivation step reduces by construction to its own inputs; the work is self-contained against external device models and task metrics.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Fahad Al Mamun, Dragica Vasileska, and Ivan Sanchez Esqueda. 2022. Impact of back-gate biasing on the transport properties of 22 nm FD-SOI MOSFETs at cryogenic temperatures.IEEE Transactions on Electron Devices69, 10 (2022), 5417–5423
2022
-
[2]
Stanley Williams, Paolo Faraboschi, Wen-mei W
Aayush Ankit, Izzat El Hajj, Sai Rahul Chalamalasetti, Geoffrey Ndu, Martin Foltin, R. Stanley Williams, Paolo Faraboschi, Wen-mei W. Hwu, John Paul Strachan, Kaushik Roy, and Dejan Milojicic. 2019. PUMA: A programmable ultra-efficient memristor-based accelerator for machine learning inference. In Proceedings of the Twenty-Fourth International Conference ...
2019
-
[3]
Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...
2020
-
[4]
An Chen. 2016. A review of emerging non-volatile memory (NVM) technologies and applications.Solid-State Electronics125 (2016), 25–38. https://doi.org/10. 1016/j.sse.2016.07.006
2016
-
[5]
Pai-Yu Chen, Xiaochen Peng, and Shimeng Yu. 2018. NeuroSim: A circuit-level macro model for benchmarking neuro-inspired architectures in online learning. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 12 (2018), 3067–3080
2018
-
[6]
Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory.ACM SIGARCH Computer Architecture News44, 3 (2016), 27–39
2016
-
[7]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255
2009
-
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186
2019
-
[9]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations (ICLR)
2021
-
[10]
Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W
Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H. Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W. Lee, and Deog- Kyoon Jeong. 2020. A3: Accelerating Attention Mechanisms in Neural Networks with Approximation. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 328–341
2020
-
[11]
Hung-Chi Han, Farzan Jazaeri, Antonio D’Amico, Zhixing Zhao, Steffen Lehmann, Claudia Kretzschmar, Edoardo Charbon, and Christian Enz. 2022. Back-gate effects on DC performance and carrier transport in 22 nm FDSOI technology down to cryogenic temperatures.Solid-State Electronics193 (2022), 108296
2022
-
[12]
Dan Hendrycks and Kevin Gimpel. 2016. Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415(2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[13]
Mark Horowitz. 2014. 1.1 computing’s energy problem (and what we can do about it). In2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC). IEEE, 10–14
2014
-
[14]
Daniele Ielmini and H-S Philip Wong. 2018. In-memory computing with resistive switching devices.Nature electronics1, 6 (2018), 333–343
2018
-
[15]
Matthew Jerry, Pai-Yu Chen, Jianchi Zhang, Pankaj Sharma, Kai Ni, Shimeng Yu, and Suman Datta. 2017. Ferroelectric FET analog synapse for acceleration of deep neural network training. In2017 IEEE international electron devices meeting (IEDM). IEEE, 6–2
2017
-
[16]
Zhouhang Jiang, A. N. M. Nafiul Islam, Zhuangyu Han, Zijian Zhao, Franz Müller, Jiahui Duan, Halid Mulaosmanovic, Stefan Dünkel, Sven Beyer, Sourav Dutta, Vijaykrishnan Narayanan, Thomas Kämpfe, Suma George Cardwell, Frances Chance, Abhronil Sengupta, and Kai Ni. 2025. A Bio-inspired Asymmetric Double-Gate Ferroelectric FET for Emulating Astrocyte and Den...
-
[17]
Zhouhang Jiang, Yi Xiao, Swetaki Chatterjee, Halid Mulaosmanovic, Stefan Duenkel, Steven Soss, Sven Beyer, Rajiv Joshi, Yogesh Singh Chauhan, Hussam Amrouch, Vijaykrishnan Narayanan, and Kai Ni. 2022. Asymmetric double-gate ferroelectric FET to decouple the tradeoff between thickness scaling and mem- ory window. In2022 IEEE Symposium on VLSI Technology an...
2022
-
[18]
Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009)
2009
-
[19]
Ann Franchesca Laguna, Mohammed Mehdi Sharifi, Arman Kazemi, Xunzhao Yin, Michael Niemier, and X Sharon Hu. 2022. Hardware-software co-design of an in-memory transformer network accelerator.Frontiers in Electronics3 (2022), 847069
2022
-
[20]
Zhikai Li, Junrui Xiao, Lianwei Yang, and Qingyi Gu. 2023. RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 17227–17236
2023
-
[21]
Hyung-Kyu Lim and Jerry G Fossum. 1983. Threshold voltage of thin-film silicon- on-insulator (SOI) MOSFET’s.IEEE Transactions on Electron Devices30, 10 (1983), 1244–1251
1983
-
[22]
Yang Lin, Tianyu Zhang, Peiqin Sun, Zheng Li, and Shuchang Zhou. 2022. FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer. InProceed- ings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI). 1173–1179
2022
-
[23]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 10012–10022
2021
-
[24]
Lizi Lu, Jasyuant Xie, Zhenhua Huang, and Huazhong Wang. 2021. Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture. In2021 54th Annual IEEE/ACM International Symposium on Microar- chitecture (MICRO). 977–991
2021
-
[25]
Halid Mulaosmanovic, Evelyn T Breyer, Thomas Mikolajick, and Stefan Slesazeck
-
[26]
Ferroelectric FETs with 20-nm-thick HfO 2 layer for large memory window and high performance.IEEE Transactions on Electron Devices66, 9 (2019), 3828– 3833
2019
-
[27]
Halid Mulaosmanovic, Dominik Kleimaier, Stefan Dünkel, Sven Beyer, Thomas Mikolajick, and Stefan Slesazeck. 2021. Ferroelectric transistors with asymmetric double gate for memory window exceeding 12 V and disturb-free read.Nanoscale 13, 38 (2021), 16258–16266
2021
-
[28]
Kai Ni, Xunzhao Yin, Ann Franchesca Laguna, Siddharth Joshi, Stefan Dünkel, Martin Trentzsch, Johannes Müller, Sven Beyer, Michael Niemier, Xiaobo Sharon Hu, and Suman Datta. 2019. Ferroelectric ternary content-addressable memory for one-shot learning.Nature Electronics2, 11 (2019), 521–529
2019
-
[29]
Olivier Nier, Denis Rideau, Yann-Michel Niquet, Frédéric Monsieur, Viet Hung Nguyen, François Triozon, Antoine Cros, Raphael Clerc, Jean-Christophe Barbé, Pierpaolo Palestri, David Esseni, Ivan Duchemin, Lee Smith, Luca Silvestri, Frédéric Nallet, Christine Tavernier, Hugues Jaouen, and Luca Selmi. 2013. Multi- scale strategy for high-k/metal-gate UTBB-FD...
2013
-
[30]
Xiaochen Peng, Shanshi Huang, Yandong Luo, Xiaoyu Sun, and Shimeng Yu
-
[31]
In2019 IEEE international electron devices meeting (IEDM)
DNN+ NeuroSim: An end-to-end benchmarking framework for compute-in- memory accelerators with versatile device technologies. In2019 IEEE international electron devices meeting (IEDM). IEEE, 32–5
-
[32]
Dayane Reis, Ann Franchesca Laguna, Michael Niemier, and Xiaobo Sharon Hu
-
[33]
InProceedings of the 26th Asia and South Pacific Design Automation Conference
Attention-in-memory for few-shot learning with configurable ferroelectric FET arrays. InProceedings of the 26th Asia and South Pacific Design Automation Conference. 49–54
-
[34]
Abu Sebastian, Manuel Le Gallo, Riduan Khaddam-Aljameh, and Evangelos Eleftheriou. 2020. Memory devices and applications for in-memory computing. Nature nanotechnology15, 7 (2020), 529–544
2020
-
[35]
Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars.ACM SIGARCH Computer Architecture News44, 3 (2016), 14–26
2016
-
[36]
Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. 2017. Pipelayer: A pipelined reram-based accelerator for deep learning. In2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 541–552
2017
-
[37]
Shrihari Sridharan, Jacob R Stevens, Kaushik Roy, and Anand Raghunathan. 2023. X-Former: In-memory acceleration of transformers.IEEE Transactions on Very Large Scale Integration (VLSI) Systems31, 8 (2023), 1223–1233
2023
-
[38]
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning in NLP. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3645–3650
2019
- [39]
-
[40]
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2022. Efficient Transformers: A Survey.ACM Comput. Surv.55, 6, Article 109 (Dec. 2022), 28 pages. 12
2022
-
[41]
S. Van Beek, K. Cai, F. Yasin, H. Hody, G. Talmelli, V. D. Nguyen, N. Franchina Vergel, A. Palomino, A. Trovato, K. Wostyn, S. Rao, G. S. Kar, and S. Couet. 2023. Scaling the SOT track – A path towards maximizing efficiency in SOT-MRAM. In2023 International Electron Devices Meeting (IEDM). IEEE, 1–4. https://doi.org/10.1109/IEDM45741.2023.10413749
-
[42]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 30
2017
-
[43]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP. 353–355
2018
-
[44]
Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7675–7688
2020
-
[45]
Hanrui Wang, Zhekai Zhang, and Song Han. 2021. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 97–110
2021
-
[46]
Chao Yang, Xiaoping Wang, and Zhigang Zeng. 2022. Full-circuit implementation of transformer network based on memristor.IEEE Transactions on Circuits and Systems I: Regular Papers69, 4 (2022), 1395–1407. https://doi.org/10.1109/TCSI. 2021.3136355
-
[47]
Xiaoxuan Yang, Bonan Yan, Hai Li, and Yiran Chen. 2020. ReTransformer: ReRAM- based processing-in-memory architecture for transformer acceleration. InPro- ceedings of the 39th International Conference on Computer-Aided Design. 1–9
2020
-
[48]
Xunzhao Yin, Xiaoming Chen, Michael Niemier, and Xiaobo Sharon Hu. 2018. Ferroelectric FETs-based nonvolatile logic-in-memory circuits.IEEE Transactions on Very Large Scale Integration (VLSI) Systems27, 1 (2018), 159–172
2018
-
[49]
Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, and Guangyu Sun. 2022. PTQ4ViT: Post-Training Quantization for Vision Transformers with Twin Uni- form Quantization. InEuropean Conference on Computer Vision (ECCV). 191–207
2022
-
[50]
Xuan Zhang, Zhuoran Song, Xing Li, Zhezhi He, Naifeng Jing, Li Jiang, and Xiaoyao Liang. 2024. Watt: A Write-Optimized RRAM-Based Accelerator for Attention. InEuropean Conference on Parallel Processing (Euro-Par). Springer, 111–125
2024
-
[51]
Minxuan Zhou, Weihong Xu, Jaeyoung Kang, and Tajana Rosing. 2022. Transpim: A memory-based acceleration via software-hardware co-design for transformer. In2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 1071–1085
2022
-
[52]
Zhe Zhou, Junlin Liu, Zhenyu Gu, and Guangyu Sun. 2022. Energon: Toward efficient acceleration of transformers using dynamic sparse attention.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems42, 1 (2022), 136–149. 13
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.