Recognition: unknown
Tensor Memory Engine: On-the-fly Data Reorganization for Ideal Locality
Pith reviewed 2026-05-10 13:31 UTC · model grok-4.3
The pith
A Tensor Memory Engine placed in the CPU data path can deliver ideal cache locality by fetching data and presenting it in a reorganized layout.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper's central claim is that a Tensor Memory Engine can be added to commercially available SoC and FPGA platforms so that it accesses memory on behalf of the CPUs and composes a re-organized view of the memory layout, thereby supplying running applications with data that exhibits ideal cache locality while keeping all computation on the CPUs and clearly separating memory access from processing.
What carries the argument
The Tensor Memory Engine: a hardware component inserted in the CPUs' data path that fetches data from memory and supplies it to the processor in a reorganized layout chosen to maximize spatiotemporal cache locality.
Load-bearing premise
The Tensor Memory Engine can be built on existing SoC and FPGA hardware without adding new bottlenecks or breaking seamless integration with standard CPUs.
What would settle it
Measure cache miss rates and execution time for representative data-intensive workloads on an SoC or FPGA with the Tensor Memory Engine enabled versus disabled; if miss rates stay high or total performance drops, the locality benefit does not materialize.
Figures
read the original abstract
The shift to data-intensive processing from the cloud to the edge has introduced new challenges and expectations for the next generation of intelligent computing systems. As the memory wall continues to grow, modern systems can only meet these performance expectations by displaying data access patterns that exhibit ideal layouts in memory and ideal spatiotemporal locality in caches. However, only a few data-intensive applications are characterized by ideal locality. Instead, most applications exhibit either (i) poor locality when naively implemented and must undergo costly redesigns and tuning or (ii) inflated memory footprint to offer proper locality. To address the aforementioned challenges, we propose a hardware/software co-designed approach that can be implemented on commercially available SoC/FPGA platforms. Our approach seamlessly inserts in the CPUs' data path a Tensor Memory Engine that provides data with an ideal cache locality to running applications by (i) accessing the memory on behalf of the CPUs and (ii) composing a re-organized view of the memory layout. Unlike in- and near-memory computing approaches, it sets itself apart by clearly decoupling computing and memory accesses; computation is still performed on CPUs while the data re-organization is delegated to the Tensor Memory Engine.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a hardware/software co-design called the Tensor Memory Engine (TME) that is inserted into the CPU data path on commercial SoC/FPGA platforms. The TME accesses memory on behalf of the CPUs and composes a re-organized view of the memory layout to deliver ideal cache locality to running applications, while all computation remains on the CPUs. This approach is positioned as distinct from in- and near-memory computing by decoupling memory reorganization from computation, aiming to address the memory wall for data-intensive edge and cloud workloads without requiring application redesigns or inflated memory footprints.
Significance. If the TME can be realized with the claimed seamless integration and locality benefits, it would represent a practical middle ground between conventional CPU-centric systems and emerging in-memory architectures, potentially reducing the performance gap for applications that currently suffer from poor locality. The explicit decoupling of concerns is a conceptual strength that could ease adoption on existing platforms. However, the complete absence of any implementation details, hardware description, simulation results, or performance data means the practical significance cannot yet be assessed.
major comments (2)
- [Abstract] Abstract: the central claim that the TME 'provides data with an ideal cache locality' and can be 'seamlessly' inserted without new bottlenecks is presented without any supporting quantitative model, latency analysis, or prototype results, leaving the load-bearing feasibility assertion unsubstantiated.
- [Proposed Approach] The manuscript provides no hardware architecture details (e.g., buffering strategy, address remapping logic, or integration with existing cache coherence protocols) for how on-the-fly reorganization is performed while maintaining correctness for arbitrary access patterns.
minor comments (2)
- The repeated use of 'ideal locality' and 'ideal cache locality' would benefit from a precise definition (e.g., in terms of compulsory/capacity/conflict misses or a target hit-rate metric) to allow future evaluation.
- No references to prior work on memory-side accelerators or data-layout engines are visible in the provided text; adding a related-work section would help situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to clarify the conceptual nature of the TME proposal while strengthening the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the TME 'provides data with an ideal cache locality' and can be 'seamlessly' inserted without new bottlenecks is presented without any supporting quantitative model, latency analysis, or prototype results, leaving the load-bearing feasibility assertion unsubstantiated.
Authors: We agree that the abstract presents the intended benefits at a high level without quantitative support. The manuscript is a conceptual hardware/software co-design proposal rather than an evaluated implementation. In the revised version, we will update the abstract to qualify the claims as design goals and add a dedicated discussion section providing a high-level analytical model of expected locality improvements and potential insertion overheads based on the described data-path integration. revision: yes
-
Referee: [Proposed Approach] The manuscript provides no hardware architecture details (e.g., buffering strategy, address remapping logic, or integration with existing cache coherence protocols) for how on-the-fly reorganization is performed while maintaining correctness for arbitrary access patterns.
Authors: The current manuscript emphasizes the system-level distinction of decoupling reorganization from computation. We acknowledge the absence of low-level hardware specifics. We will expand the Proposed Approach section with additional high-level architectural descriptions covering buffering concepts for on-the-fly access, logical address remapping to support arbitrary patterns, and high-level considerations for maintaining coherence and correctness through software-managed mappings. revision: yes
- Providing concrete prototype results, full hardware descriptions, or simulation data, as the work remains at the conceptual proposal stage without an implemented or simulated prototype.
Circularity Check
No significant circularity detected in the Tensor Memory Engine proposal
full rationale
The paper presents a hardware/software co-design proposal for inserting a Tensor Memory Engine into the CPU data path to perform on-the-fly data reorganization while keeping computation on the CPU. No mathematical derivations, equations, parameter fittings, or self-citations appear in the provided text that would reduce any claim to its inputs by construction. The central argument explicitly decouples computing from memory accesses, targets commercial SoC/FPGA platforms, and contrasts with in/near-memory computing without invoking uniqueness theorems, ansatzes, or fitted predictions. The approach is self-contained as a descriptive design proposal rather than a closed-form derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Modern systems require ideal data layouts and spatiotemporal locality for performance.
invented entities (1)
-
Tensor Memory Engine
no independent evidence
Reference graph
Works this paper leans on
-
[1]
ht tps://ubuntu.com/download/amd#kria-k26
Install Ubuntu on AMD | Ubuntu — ubuntu.com. ht tps://ubuntu.com/download/amd#kria-k26 . [Ac- cessed 11-12-2025]
2025
-
[2]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irv- ing, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan- delion Mané, Rajat Monga, Sherry Moore, Derek...
-
[3]
URL: https://www.tensorflow.org/
Software available from tensorflow.org. URL: https://www.tensorflow.org/
-
[4]
Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, and Stephen W. Keckler. Page placement strategies for gpus within heterogeneous memory sys- tems. ASPLOS ’15, page 607–618, New York, NY , USA, 2015. Association for Computing Machinery. doi:10.1145/2694344.2694381
-
[5]
Alonso, T
G. Alonso, T. Roscoe, D. Cock, M. Ewaida, Kaan Kara, Dario Korolija, D. Sidler, and Ze ke Wang. Tackling hardware/software co-design from a database perspec- tive. InConference on Innovative Data Systems Re- search (CIDR), Amsterdam, Netherlands, Jan. 2020. 11
2020
-
[6]
AMBA AXI and ACE Protocol Specification
ARM. AMBA AXI and ACE Protocol Specification. Technical report, 2013. URL: https://developer.ar m.com/documentation/ihi0022/e/
2013
-
[7]
Evgeny Bolotin, David Nellans, Oreste Villa, Mike O’Connor, Alex Ramirez, and Stephen W. Keckler. De- signing efficient heterogeneous memory architectures. IEEE Micro, 35(4):60–68, 2015. doi:10.1109/MM.2 015.72
-
[8]
Ting-Jung Chang, Ang Li, Fei Gao, Tuan Ta, Georgios Tziantzioulis, Yanghui Ou, Moyang Wang, Jinzheng Tu, Kaifeng Xu, Paul J. Jackson, August Ning, Grigory Chirkov, Marcelo Orenes-Vera, Shady Agwa, Xiaoyu Yan, Eric Tang, Jonathan Balkind, Christopher Batten, and David Wentzlaff. Cifer: A 12nm, 16mm2, 22-core soc with a 1541 lut6/mm2 1.92 mops/lut, fully sy...
-
[9]
Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: an automated end- to-end optimizing compiler for deep learning. In An- drea C. Arpaci-Dusseau and Geoff V oelker, editors,13th USENIX Symposium on Operating Systems Design and...
2018
-
[10]
URL: https://www.usenix.org/conferenc e/osdi18/presentation/chen
-
[11]
Cormen, Charles E
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009
2009
-
[12]
tf.batch_to_space
TensorFlow Developers. tf.batch_to_space. https: //www.tensorflow.org/api_docs/python/tf/ba tch_to_space, 2025. Accessed: 2025-12-01
2025
-
[13]
Nastaran Hajinazar, Geraldo F. Oliveira, Sven Grego- rio, João Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gómez Luna, and Onur Mutlu. Simdram: An end-to-end framework for bit-serial simd computing in dram, 2021. URL: https://arxiv.org/abs/2105.12839 , arXiv: 2105.12839
-
[14]
Work in progress: Identifying unexpected inter-core interference induced by shared cache
Denis Hoornaert, Shahin Roozkhosh, Renato Mancuso, and Marco Caccamo. Work in progress: Identifying unexpected inter-core interference induced by shared cache. In2021 IEEE 27th Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 517–520, 2021. doi:10.1109/RTAS52030.2021.000 66
-
[15]
Intel. 3.5. RLDRAM II and RLDRAM 3 Features. http s://www.intel.com/content/www/us/en/docs/p rogrammable/710283/17-0/rldram-ii-and-rld ram-3-features.html, 2025. Accessed: 2025-12-01
2025
-
[16]
Intel’s Stratix 10 FPGA: Supporting the smart and connected revolution, October 2016
Intel, Corp. Intel’s Stratix 10 FPGA: Supporting the smart and connected revolution, October 2016. Ac- cessed on 2022-01-19. URL: https://newsroom .intel.com/editorials/intels-stratix-10-f pga-supporting-smart-connected-revolution/
2016
-
[17]
Hbm (high bandwidth memory) dram technology and archi- tecture
Hongshin Jun, Jinhee Cho, Kangseol Lee, Ho-Young Son, Kwiwook Kim, Hanho Jin, and Keith Kim. Hbm (high bandwidth memory) dram technology and archi- tecture. In2017 IEEE International Memory Workshop (IMW), pages 1–4, 2017. doi:10.1109/IMW.2017.7 939084
-
[18]
Hongbo Kang, Phillip B. Gibbons, Guy E. Blelloch, Laxman Dhulipala, Yan Gu, and Charles McGuffey. The processing-in-memory model. InProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’21, page 295–306, New York, NY , USA, 2021. Association for Computing Machinery. doi:10.1145/3409964.3461816
-
[19]
In-datacenter performance analysis of a tensor processing unit,
Sudarsun Kannan, Ada Gavrilovska, Vishal Gupta, and Karsten Schwan. Heteroos — os design for hetero- geneous memory management in datacenter. In2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pages 521–534, 2017. doi:10.1145/3079856.3080245
-
[20]
Liu Ke, Xuan Zhang, Jinin So, Jong-Geon Lee, Shin- Haeng Kang, Sukhan Lee, Songyi Han, YeonGon Cho, Jin Hyun Kim, Yongsuk Kwon, KyungSoo Kim, Jin Jung, Ilkwon Yun, Sung Joo Park, Hyunsun Park, Joonho Song, Jeonghyeon Cho, Kyomin Sohn, Nam Sung Kim, and Hsien-Hsin S. Lee. Near-memory processing in action: Accelerating personalized recommendation with axdim...
-
[21]
Hyeonjin Kim, Sungwoo Ahn, Yunho Oh, Bogil Kim, Won Woo Ro, and William J. Song. Duplo: Lifting re- dundant memory accesses of deep neural networks for gpu tensor cores. In2020 53rd Annual IEEE/ACM In- ternational Symposium on Microarchitecture (MICRO), pages 725–737, 2020. doi:10.1109/MICRO50266.2 020.00065
-
[22]
Tamara G. Kolda and Brett W. Bader. Tensor decompo- sitions and applications.SIAM Review, 51(3):455–500, 2009.doi:10.1137/07070111X. 12
-
[23]
Sukhan Lee, Shin-haeng Kang, Jaehoon Lee, Hyeonsu Kim, Eojin Lee, Seungwoo Seo, Hosang Yoon, Seung- won Lee, Kyounghwan Lim, Hyunsung Shin, Jinhyun Kim, O Seongil, Anand Iyer, David Wang, Kyomin Sohn, and Nam Sung Kim. Hardware architecture and soft- ware stack for pim based on commercial dram tech- nology : Industrial product. In2021 ACM/IEEE 48th Annual...
-
[24]
Real-world Processing-in-Memory Systems for Modern Workloads
Juan Gómez Luna, Onur Mutlu, and Ataberk Olgun. Real-world Processing-in-Memory Systems for Modern Workloads . https://ev ents.safari.ethz .ch/isca-pim-tutorial/doku.php, 2023
2023
-
[25]
Sangkug Lym, Donghyuk Lee, Mike O’Connor, Niladr- ish Chatterjee, and Mattan Erez. Delta: Gpu perfor- mance model for deep learning applications with in- depth memory system traffic analysis. In2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), page 293–303. IEEE, March 2019. URL: http://dx.doi.org/10.1109/I SPASS....
work page doi:10.1109/i 2019
-
[26]
PolarFire SoC - Lowest Power, Multi-Core RISC-V SoC FPGA, July 2020
Microsemi — Microchip Technology Inc. PolarFire SoC - Lowest Power, Multi-Core RISC-V SoC FPGA, July 2020. Accessed on 09.01.2020. URL: https: //www.microsemi.com/product-directory/soc -fpgas/5498-polarfire-soc-fpga
2020
- [28]
-
[29]
Mattia Nicolella, Shahin Roozkhosh, Denis Hoornaert, Andrea Bastoni, and Renato Mancuso. Rt-bench: an ex- tensible benchmark framework for the analysis and man- agement of real-time applications. InProceedings of the 30th International Conference on Real-Time Networks and Systems, page 184–195. ACM, June 2022. URL: http://dx.doi.org/10.1145/3534879.353488...
-
[30]
Ataberk Olgun, F. Nisa Bostanci, Geraldo F. Oliveira, Yahya Can Tugrul, Rahul Bera, A. Giray Yaglikci, Hasan Hassan, Oguz Ergin, and Onur Mutlu. Sectored dram: A practical energy-efficient and high-performance fine- grained dram architecture, 2024. URL: https://arxi v.org/abs/2207.13795,arXiv:2207.13795
-
[31]
Daniel Pacheco, Miguel Graça, Filipe Borralho, Leonel Sousa, and Aleksandar Ilic. Bridging portability and per- formance in sparse tensor computations using sycl.Con- currency and Computation: Practice and Experience, 37(27-28):e70369, 2025. URL: https://onlinelibr ary.wiley.com/doi/abs/10.1002/cpe.70369 , ar Xiv:https://onlinelibrary.wiley.com/doi/pd f/1...
-
[32]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[33]
A Case for Intelligent RAM.IEEE Micro, 17(2):34–44, 1997
David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. A Case for Intelligent RAM.IEEE Micro, 17(2):34–44, 1997. URL: http://dx.doi.org/10.1109/40.592312
-
[34]
CAESAR: Coherence-Aided Elective and Seam- less Alternative Routing via on-chip FPGA
Shahin Roozkhosh, Denis Hoornaert, and Renato Man- cuso. CAESAR: Coherence-Aided Elective and Seam- less Alternative Routing via on-chip FPGA. InProceed- ings of the IEEE Real-Time Systems Symposium (RTSS), pages 356–369, 2022. URL: https://doi.org/10.1 109/RTSS55097.2022.00038
-
[35]
Relational Memory: Native In-Memory Accesses on Rows and Columns.CoRR, abs/2109.1, 2022
Shahin Roozkhosh, Denis Hoornaert, Ju Hyoung Mun, Tarikul Islam Papon, Ulrich Drepper, Renato Mancuso, and Manos Athanassoulis. Relational Memory: Native In-Memory Accesses on Rows and Columns.CoRR, abs/2109.1, 2022. URL: https://arxiv.org/abs/ 2109.14349
-
[36]
Relational Memory: Native In-Memory Accesses on Rows and Columns
Shahin Roozkhosh, Denis Hoornaert, Ju Hyoung Mun, Tarikul Islam Papon, Ahmed Sanaullah, Ulrich Drepper, Renato Mancuso, and Manos Athanassoulis. Relational Memory: Native In-Memory Accesses on Rows and Columns. InProceedings of the International Confer- ence on Extending Database Technology (EDBT), pages 66–79, 2023. URL: https://doi.org/10.48786/e dbt.2023.06
work page doi:10.48786/e 2023
-
[37]
The Potential of Programmable Logic in the Middle: Cache Bleach- ing
Shahin Roozkhosh and Renato Mancuso. The Potential of Programmable Logic in the Middle: Cache Bleach- ing. InProceedings of the Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 296–309, 2020. URL: https://doi.org/10.1109/ RTAS48715.2020.00006. 13
-
[38]
Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie S. Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry. Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology. InProceedings of the Annual IEEE/ACM In- ternational Symposium on Microarchitecture (MICRO), pages 273...
-
[39]
Near-memory comput- ing: Past, present, and future.Microprocess
Gagandeep Singh, Lorenzo Chelini, Stefano Corda, Ah- san Javed Awan, Sander Stuijk, Roel Jordans, Henk Cor- poraal, and Albert-Jan Boonstra. Near-memory comput- ing: Past, present, and future.Microprocess. Microsyst., 71(C), November 2019. doi:10.1016/j.micpro.201 9.102868
-
[40]
Micro-architectural Exploration of the Re- lational Memory Engine (RME) in RISC-V and FireSim
Cole Strickler, Ju Hyoung Mun, Connor Sullivan, Denis Hoornaert, Renato Mancuso, Manos Athanassoulis, and Heechul Yun. Micro-architectural Exploration of the Re- lational Memory Engine (RME) in RISC-V and FireSim. InProceedings of Workshops at the International Con- ference on Very Large Data Bases (VLDB Workshops),
-
[41]
URL: https://www.vldb.org/2025/Worksh ops/VLDB-Workshops-2025/ADMS/ADMS25-08.pdf
2025
-
[42]
Neriman Tokcan, Shakir Showkat Sofi, Van Tien Pham, Clémence Prévost, Sofiane Kharbech, Baptiste Magnier, Thanh Phuong Nguyen, Yassine Zniyed, and Lieven De Lathauwer. Tensor decompositions for signal process- ing: Theory, advances, and applications.Signal Process- ing, 238:110191, 2026. URL: https://www.scienc edirect.com/science/article/pii/S016516842 5...
-
[43]
Taming non-blocking caches to improve isolation in multicore real-time systems
Prathap Kumar Valsan, Heechul Yun, and Farzad Farshchi. Taming non-blocking caches to improve isolation in multicore real-time systems. In2016 IEEE Real-Time and Embedded Technology and Ap- plications Symposium (RTAS), pages 1–12, 2016. doi: 10.1109/RTAS.2016.7461361
-
[44]
Hitting the mem- ory wall: Implications of the obvious.ACM SIGARCH computer architecture news, 23(1):20–24, 1995
Wm A Wulf and Sally A McKee. Hitting the mem- ory wall: Implications of the obvious.ACM SIGARCH computer architecture news, 23(1):20–24, 1995
1995
-
[45]
ZCU 102 MPSoC TRM
Xilinx. ZCU 102 MPSoC TRM. https://docs.xilinx.com/r/en-US/ug1085-zynq- ultrascale-trm/Zynq-UltraScale-Device-Technical- Reference-Manual, September 2022. Accessed: 2022-11-08
2022
-
[46]
The dynamic granularity memory sys- tem
Doe Hyun Yoon, Min Kyu Jeong, Michael Sullivan, and Mattan Erez. The dynamic granularity memory sys- tem. InProceedings of the 39th Annual International Symposium on Computer Architecture, ISCA ’12, page 548–559, USA, 2012. IEEE Computer Society
2012
-
[47]
Yangjie Zhou, Mengtian Yang, Cong Guo, Jingwen Leng, Yun Liang, Quan Chen, Minyi Guo, and Yuhao Zhu. Characterizing and demystifying the implicit con- volution algorithm on commercial matrix-multiplication accelerators, 2021. URL: https://arxiv.org/abs/ 2110.03901,arXiv:2110.03901. 14
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.