Recognition: unknown
FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture
Pith reviewed 2026-05-07 14:32 UTC · model grok-4.3
The pith
FusionCIM fuses attention operations inside compute-in-memory hardware to deliver up to 3.86 times lower energy for large language model inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FusionCIM is an operator-fusion-driven compute-in-memory accelerator for LLM inference built around three linked mechanisms: a hybrid pipeline that assigns QKT matrix work to inner-product CIM units and PV aggregation to outer-product CIM units, a QO-stationary dataflow that removes repeated KV loads and transpose-related buffer accesses, and a pattern-aware online-softmax that exploits score distribution regularities to lower exponential rescaling cost. When evaluated on LLaMA-3, the design reports up to 3.86 times energy reduction and 1.98 times speedup against prior state-of-the-art CIM accelerators while reaching 29.4 TOPS/W at the full system level.
What carries the argument
The hybrid CIM pipeline that maps QKT computation onto inner-product units and PV aggregation onto outer-product units, paired with QO-stationary dataflow to keep matrix data on chip during fusion.
If this is right
- Attention-heavy layers in transformers can be executed with far higher on-chip data reuse than in conventional CIM designs.
- Matrix-multiplication fusion across QKT and PV stages becomes practical inside a single memory array pipeline.
- Nonlinear operations such as softmax can be simplified by exploiting statistical regularities in attention scores rather than computing every exponential exactly.
- System-level efficiency for autoregressive decoding improves when KV cache movement is minimized by the stationary dataflow.
Where Pith is reading between the lines
- The same fusion pattern could be tested on other transformer variants that share the same QKV attention structure.
- Chip designers working on general-purpose accelerators might borrow the distribution-aware softmax reduction for low-precision workloads.
- If the simulated efficiency holds in silicon, the architecture could lower the power budget needed for on-device LLM inference enough to support longer context windows.
- Integration with existing high-bandwidth memory stacks would need to preserve the stationary dataflow benefits without adding new interface stalls.
Load-bearing premise
The reported speedups and energy savings are produced by simulation models that assume the three fusion techniques incur no hidden hardware overheads when built in real silicon.
What would settle it
Fabricate a test chip implementing the hybrid pipeline, QO-stationary dataflow, and pattern-aware softmax, then measure its actual energy per token and latency for LLaMA-3 inference and compare the numbers to the simulation predictions.
Figures
read the original abstract
In this paper, we propose FusionCIM, an operator-fusion-driven compute-in-memory (CIM) accelerator architecture for efficient and scalable LLM inference, with three key innovations: (1) a hybrid CIM pipeline architecture that maps QKT computation on inner-product-based CIM (IP-CIM) and PV aggregation on outer-product-based CIM (OP-CIM) for efficient matrix multiplications fusion; (2) a QO-stationary dataflow that eliminates repeated KV loading in CIM and K-matrix access in buffer under transpose fusion, significantly improving data reuse on chip; and (3) a pattern-aware online-softmax mechanism that exploits distribution regularities of attention scores to reduce exponential rescaling overhead for non-linear fusion. Experimental results on LLaMA-3 model show that FusionCIM achieves up to 3.86x energy saving, and 1.98x speedup compared with prior SOTA CIM-based designs with 29.4 TOPS/W energy efficiency at the system level.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes FusionCIM, a fusion-driven computing-in-memory architecture for LLM inference. It introduces three innovations: (1) a hybrid pipeline mapping QKT to inner-product CIM (IP-CIM) and PV to outer-product CIM (OP-CIM) for matrix-multiplication fusion; (2) a QO-stationary dataflow that eliminates repeated KV-cache loads and K-matrix buffer accesses under transpose; and (3) a pattern-aware online-softmax that exploits attention-score distribution regularities to cut exponential rescaling costs. On LLaMA-3, the design is reported to deliver up to 3.86× energy savings and 1.98× speedup versus prior SOTA CIM accelerators while achieving 29.4 TOPS/W system-level efficiency.
Significance. If the modeled gains prove robust, the work offers a concrete template for exploiting operator fusion inside CIM arrays to reduce off-chip and on-chip data movement in attention layers, a dominant bottleneck for LLM inference. The explicit separation of IP-CIM and OP-CIM roles plus the stationary dataflow provide reusable ideas for future CIM designs targeting transformers.
major comments (2)
- [Evaluation] Evaluation section: the headline claims (3.86× energy, 1.98× speedup, 29.4 TOPS/W) rest on architectural simulation; the manuscript provides no equations or tables that quantify the modeled energy of control logic, analog non-idealities, or residual buffer traffic after QO-stationary fusion, making it impossible to verify that the reported gains survive realistic hardware effects.
- [§3 and §4] §3 (Hybrid CIM Pipeline) and §4 (QO-stationary Dataflow): the central claim that the hybrid mapping plus QO-stationary flow “eliminates repeated KV loading” is load-bearing for the speedup numbers, yet no cycle-accurate breakdown or sensitivity study shows the fraction of energy saved by each mechanism versus baseline CIM designs.
minor comments (2)
- [Abstract] Abstract: the phrase “experimental results” should be qualified as “cycle-accurate architectural simulation” to avoid implying silicon measurements.
- All result tables should include absolute baseline numbers (energy, latency, TOPS/W) alongside the reported speedups so readers can recompute the ratios.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and commit to revisions that strengthen the evaluation rigor without altering the core claims.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the headline claims (3.86× energy, 1.98× speedup, 29.4 TOPS/W) rest on architectural simulation; the manuscript provides no equations or tables that quantify the modeled energy of control logic, analog non-idealities, or residual buffer traffic after QO-stationary fusion, making it impossible to verify that the reported gains survive realistic hardware effects.
Authors: We acknowledge that the current manuscript presents results from architectural simulation without explicit equations or tables breaking down control logic energy, analog non-idealities, and post-fusion residual buffer traffic. These components are modeled at a system level using standard parameters from prior CIM literature, but we agree that greater transparency is needed. In the revised version we will add a new subsection detailing the energy model equations for control overhead and buffer traffic, a table of component-wise energy contributions, and a discussion of non-ideality sensitivity drawn from published CIM characterizations. This will enable readers to assess robustness under realistic effects. revision: yes
-
Referee: [§3 and §4] §3 (Hybrid CIM Pipeline) and §4 (QO-stationary Dataflow): the central claim that the hybrid mapping plus QO-stationary flow “eliminates repeated KV loading” is load-bearing for the speedup numbers, yet no cycle-accurate breakdown or sensitivity study shows the fraction of energy saved by each mechanism versus baseline CIM designs.
Authors: The hybrid IP-CIM/OP-CIM mapping combined with QO-stationary dataflow is designed to keep Q and O activations resident in the arrays, thereby removing repeated KV-cache loads and K-matrix buffer accesses under transpose. While the manuscript reports aggregate gains versus prior SOTA, we concur that isolating the contribution of each technique would strengthen the paper. We will incorporate a cycle-accurate energy breakdown table and sensitivity analysis in the evaluation section of the revised manuscript, showing the fractional savings from the hybrid pipeline, QO-stationary flow, and pattern-aware softmax relative to baseline CIM designs. revision: yes
Circularity Check
No circularity: performance claims rest on simulation of described architecture, not self-referential definitions or fitted inputs
full rationale
The paper describes a hybrid IP-CIM/OP-CIM pipeline, QO-stationary dataflow, and pattern-aware online-softmax as architectural innovations, then reports simulated speedups and energy savings on LLaMA-3. No equations, fitted parameters, or derivation chains appear in the provided text that would reduce the claimed 3.86x energy saving or 29.4 TOPS/W to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems. The results are presented as outcomes of the proposed design choices under simulation assumptions, which remain externally falsifiable and do not collapse into renaming or self-definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions in CIM hardware design such as ideal memory behavior and negligible interconnect overhead
Reference graph
Works this paper leans on
-
[1]
Language mod- els are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020
1901
-
[2]
Next-gpt: Any-to-any multimodal llm,
S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any multimodal llm,” inForty-first International Conference on Machine Learning, 2024
2024
-
[3]
Cambricon-llm: A chiplet-based hybrid archi- tecture for on-device inference of 70b llm,
Z. Yu, S. Liang, T. Ma, Y . Cai, Z. Nan, D. Huang, X. Song, Y . Hao, J. Zhang, T. Zhiet al., “Cambricon-llm: A chiplet-based hybrid archi- tecture for on-device inference of 70b llm,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024, pp. 1474–1488
2024
-
[4]
Nvidia a100 tensor core gpu: Performance and innovation,
J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky, “Nvidia a100 tensor core gpu: Performance and innovation,”IEEE Micro, vol. 41, no. 2, pp. 29–35, 2021
2021
-
[5]
The design process for google’s training chips: Tpuv2 and tpuv3,
T. Norrie, N. Patil, D. H. Yoon, G. Kurian, S. Li, J. Laudon, C. Young, N. Jouppi, and D. Patterson, “The design process for google’s training chips: Tpuv2 and tpuv3,”IEEE Micro, vol. 41, no. 2, pp. 56–63, 2021
2021
-
[6]
16.4 an 89tops/w and 16.3 tops/mm 2 all-digital sram-based full-precision compute-in memory macro in 22nm for machine-learning edge applications,
Y .-D. Chih, P.-H. Lee, H. Fujiwara, Y .-C. Shih, C.-F. Lee, R. Naous, Y .-L. Chen, C.-P. Lo, C.-H. Lu, H. Moriet al., “16.4 an 89tops/w and 16.3 tops/mm 2 all-digital sram-based full-precision compute-in memory macro in 22nm for machine-learning edge applications,” in2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64. IEEE, 2021, pp...
2021
-
[7]
A brain-inspired adc-free sram-based in-memory computing macro with high-precision mac for ai application,
Z. Xuan, C. Liu, Y . Zhang, Y . Li, and Y . Kang, “A brain-inspired adc-free sram-based in-memory computing macro with high-precision mac for ai application,”IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 70, no. 4, pp. 1276–1280, 2022
2022
-
[8]
Fast inference from transform- ers via speculative decoding,
Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transform- ers via speculative decoding,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 19 274–19 286
2023
-
[9]
Trancim: Full-digital bitline-transpose cim-based sparse transformer accelerator with pipeline/parallel reconfigurable modes,
F. Tu, Z. Wu, Y . Wang, L. Liang, L. Liu, Y . Ding, L. Liu, S. Wei, Y . Xie, and S. Yin, “Trancim: Full-digital bitline-transpose cim-based sparse transformer accelerator with pipeline/parallel reconfigurable modes,” IEEE Journal of Solid-State Circuits, vol. 58, no. 6, pp. 1798–1809, 2022
2022
-
[10]
16.4 tensorcim: A 28nm 3.7 nj/gather and 8.3 tflops/w fp32 digital-cim tensor processor for mcm-cim-based beyond-nn acceleration,
F. Tu, Y . Wang, Z. Wu, W. Wu, L. Liu, Y . Hu, S. Wei, and S. Yin, “16.4 tensorcim: A 28nm 3.7 nj/gather and 8.3 tflops/w fp32 digital-cim tensor processor for mcm-cim-based beyond-nn acceleration,” in2023 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2023, pp. 254–256
2023
-
[11]
P 3 vit: A cim-based high-utilization architecture with dynamic pruning and two-way ping-pong macro for vision transformer,
X. Fu, Q. Ren, H. Wu, F. Xiang, Q. Luo, J. Yue, Y . Chen, and F. Zhang, “P 3 vit: A cim-based high-utilization architecture with dynamic pruning and two-way ping-pong macro for vision transformer,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 70, no. 12, pp. 4938–4948, 2023
2023
-
[12]
Syscim: A heterogeneous chip architecture for high-efficiency cnn training at edge,
S. Wang, Z. Li, Y . Ma, and Y . Kang, “Syscim: A heterogeneous chip architecture for high-efficiency cnn training at edge,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2025
2025
-
[13]
Tp-dcim: Transposable digital sram cim architecture for energy-efficient and high throughput transformer acceleration,
J. Park, K. Lee, and J. Park, “Tp-dcim: Transposable digital sram cim architecture for energy-efficient and high throughput transformer acceleration,” inProceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, 2024, pp. 1–8
2024
-
[14]
Flashattention: Fast and memory-efficient exact attention with io-awareness,
T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,”Advances in neural information processing systems, vol. 35, pp. 16 344–16 359, 2022
2022
-
[15]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[16]
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review arXiv 2024
-
[17]
F. Gloeckle, B. Y . Idrissi, B. Rozi `ere, D. Lopez-Paz, and G. Synnaeve, “Better & faster large language models via multi-token prediction,”arXiv preprint arXiv:2404.19737, 2024
-
[18]
Autodcim: An automated digital cim compiler,
J. Chen, F. Tu, K. Shao, F. Tian, X. Huo, C.-Y . Tsui, and K.-T. Cheng, “Autodcim: An automated digital cim compiler,” in2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 2023, pp. 1–6
2023
-
[19]
A 28nm 20.9-137.2 tops/w output-stationary sram compute-in-memory macro featuring dynamic look-ahead zero weight skipping and runtime partial sum quantization,
X. Hu, H. Mun, J. Meng, Y . Liao, A. Sridharan, and J.-s. Seo, “A 28nm 20.9-137.2 tops/w output-stationary sram compute-in-memory macro featuring dynamic look-ahead zero weight skipping and runtime partial sum quantization,” in2025 IEEE Custom Integrated Circuits Conference (CICC). IEEE, 2025, pp. 1–3
2025
-
[20]
Dnn+ neurosim: An end- to-end benchmarking framework for compute-in-memory accelerators with versatile device technologies,
X. Peng, S. Huang, Y . Luo, X. Sun, and S. Yu, “Dnn+ neurosim: An end- to-end benchmarking framework for compute-in-memory accelerators with versatile device technologies,” in2019 IEEE international electron devices meeting (IEDM). IEEE, 2019, pp. 32–5
2019
-
[21]
Cacti 7: New tools for interconnect exploration in innovative off-chip memories,
R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V . Srinivas, “Cacti 7: New tools for interconnect exploration in innovative off-chip memories,”ACM Transactions on Architecture and Code Optimization (TACO), vol. 14, no. 2, pp. 1–25, 2017
2017
-
[22]
The llama 3 herd of models,
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv e-prints, pp. arXiv–2407, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.