arxiv: 2604.13319 · v1 · submitted 2026-04-14 · 💻 cs.AR

Recognition: unknown

Tensor Memory Engine: On-the-fly Data Reorganization for Ideal Locality

Denis Hoornaert , Cole Strickler , Manos Athanassoulis , Marco Caccamo , Heechul Yun , Renato Mancuso

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:31 UTC · model grok-4.3

classification 💻 cs.AR

keywords tensor memory enginedata reorganizationcache localityhardware software co-designmemory walledge computingSoC FPGAdata-intensive applications

0 comments

The pith

A Tensor Memory Engine placed in the CPU data path can deliver ideal cache locality by fetching data and presenting it in a reorganized layout.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern data-intensive applications often suffer from poor memory locality or require extra memory to fix it, which limits performance as the memory wall grows. The paper shows how a dedicated hardware unit called the Tensor Memory Engine can be inserted directly into the path between CPU and memory. This unit fetches data from its original locations and supplies the CPU with the same data but laid out for perfect cache behavior. Computation continues to run on the CPU while the engine handles all the reorganization work, avoiding the need to rewrite software or duplicate data structures.

Core claim

The paper's central claim is that a Tensor Memory Engine can be added to commercially available SoC and FPGA platforms so that it accesses memory on behalf of the CPUs and composes a re-organized view of the memory layout, thereby supplying running applications with data that exhibits ideal cache locality while keeping all computation on the CPUs and clearly separating memory access from processing.

What carries the argument

The Tensor Memory Engine: a hardware component inserted in the CPUs' data path that fetches data from memory and supplies it to the processor in a reorganized layout chosen to maximize spatiotemporal cache locality.

Load-bearing premise

The Tensor Memory Engine can be built on existing SoC and FPGA hardware without adding new bottlenecks or breaking seamless integration with standard CPUs.

What would settle it

Measure cache miss rates and execution time for representative data-intensive workloads on an SoC or FPGA with the Tensor Memory Engine enabled versus disabled; if miss rates stay high or total performance drops, the locality benefit does not materialize.

Figures

Figures reproduced from arXiv: 2604.13319 by Cole Strickler, Denis Hoornaert, Heechul Yun, Manos Athanassoulis, Marco Caccamo, Renato Mancuso.

**Figure 2.** Figure 2: Overview of the system organization. seamlessly interacts with the PS-side caches by leveraging the PLIM paradigm—see Section 2. Software layer. We propose a set of APIs that allow end users to leverage TME easily and without specialized knowledge. Said API is intentionally made opaque to abstract away hardware-specific technical details. This is achieved by implementing a two-layer user/kernel software … view at source ↗

**Figure 3.** Figure 3: Abstract overview of TME’s integration in a PS-PL [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: TME’s architecture overview. The ACE port is [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: TME is able to offer (i) speedup by eliminating transformation time (Fig. 5a) and (ii) ideal locality decrease transient working set size (WSS) by avoiding materialization of intermediate layouts (Fig. 5b). 16 32 64 128 Tensor size (MB) 100 200 300 400 500 600 Bandwidth (MB/s) 2 34 2 3 4 2 3 4 2 3 4 TME bandwidth across element sizes uint8_t uint16_t uint32_t uint64_t [PITH_FULL_IMAGE:figures/full_fig_p01… view at source ↗

**Figure 6.** Figure 6: For smaller element sizes, more TME-DRAM trans [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

The shift to data-intensive processing from the cloud to the edge has introduced new challenges and expectations for the next generation of intelligent computing systems. As the memory wall continues to grow, modern systems can only meet these performance expectations by displaying data access patterns that exhibit ideal layouts in memory and ideal spatiotemporal locality in caches. However, only a few data-intensive applications are characterized by ideal locality. Instead, most applications exhibit either (i) poor locality when naively implemented and must undergo costly redesigns and tuning or (ii) inflated memory footprint to offer proper locality. To address the aforementioned challenges, we propose a hardware/software co-designed approach that can be implemented on commercially available SoC/FPGA platforms. Our approach seamlessly inserts in the CPUs' data path a Tensor Memory Engine that provides data with an ideal cache locality to running applications by (i) accessing the memory on behalf of the CPUs and (ii) composing a re-organized view of the memory layout. Unlike in- and near-memory computing approaches, it sets itself apart by clearly decoupling computing and memory accesses; computation is still performed on CPUs while the data re-organization is delegated to the Tensor Memory Engine.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes a Tensor Memory Engine inserted in the CPU data path to reorganize data on the fly for better locality while leaving computation on the CPU, but the abstract gives no implementation or results to check if it works.

read the letter

The core idea is straightforward: add a hardware block that grabs data from memory, reshapes its layout for ideal cache behavior, and feeds it to the CPU without the CPU doing the reorganization itself. This keeps the software side unchanged and avoids the usual tradeoffs of either rewriting apps or wasting memory space on padded layouts. It also draws a clean line against in-memory or near-memory compute by keeping all the actual work on the existing CPUs.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a hardware/software co-design called the Tensor Memory Engine (TME) that is inserted into the CPU data path on commercial SoC/FPGA platforms. The TME accesses memory on behalf of the CPUs and composes a re-organized view of the memory layout to deliver ideal cache locality to running applications, while all computation remains on the CPUs. This approach is positioned as distinct from in- and near-memory computing by decoupling memory reorganization from computation, aiming to address the memory wall for data-intensive edge and cloud workloads without requiring application redesigns or inflated memory footprints.

Significance. If the TME can be realized with the claimed seamless integration and locality benefits, it would represent a practical middle ground between conventional CPU-centric systems and emerging in-memory architectures, potentially reducing the performance gap for applications that currently suffer from poor locality. The explicit decoupling of concerns is a conceptual strength that could ease adoption on existing platforms. However, the complete absence of any implementation details, hardware description, simulation results, or performance data means the practical significance cannot yet be assessed.

major comments (2)

[Abstract] Abstract: the central claim that the TME 'provides data with an ideal cache locality' and can be 'seamlessly' inserted without new bottlenecks is presented without any supporting quantitative model, latency analysis, or prototype results, leaving the load-bearing feasibility assertion unsubstantiated.
[Proposed Approach] The manuscript provides no hardware architecture details (e.g., buffering strategy, address remapping logic, or integration with existing cache coherence protocols) for how on-the-fly reorganization is performed while maintaining correctness for arbitrary access patterns.

minor comments (2)

The repeated use of 'ideal locality' and 'ideal cache locality' would benefit from a precise definition (e.g., in terms of compulsory/capacity/conflict misses or a target hit-rate metric) to allow future evaluation.
No references to prior work on memory-side accelerators or data-layout engines are visible in the provided text; adding a related-work section would help situate the contribution.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to clarify the conceptual nature of the TME proposal while strengthening the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the TME 'provides data with an ideal cache locality' and can be 'seamlessly' inserted without new bottlenecks is presented without any supporting quantitative model, latency analysis, or prototype results, leaving the load-bearing feasibility assertion unsubstantiated.

Authors: We agree that the abstract presents the intended benefits at a high level without quantitative support. The manuscript is a conceptual hardware/software co-design proposal rather than an evaluated implementation. In the revised version, we will update the abstract to qualify the claims as design goals and add a dedicated discussion section providing a high-level analytical model of expected locality improvements and potential insertion overheads based on the described data-path integration. revision: yes
Referee: [Proposed Approach] The manuscript provides no hardware architecture details (e.g., buffering strategy, address remapping logic, or integration with existing cache coherence protocols) for how on-the-fly reorganization is performed while maintaining correctness for arbitrary access patterns.

Authors: The current manuscript emphasizes the system-level distinction of decoupling reorganization from computation. We acknowledge the absence of low-level hardware specifics. We will expand the Proposed Approach section with additional high-level architectural descriptions covering buffering concepts for on-the-fly access, logical address remapping to support arbitrary patterns, and high-level considerations for maintaining coherence and correctness through software-managed mappings. revision: yes

standing simulated objections not resolved

Providing concrete prototype results, full hardware descriptions, or simulation data, as the work remains at the conceptual proposal stage without an implemented or simulated prototype.

Circularity Check

0 steps flagged

No significant circularity detected in the Tensor Memory Engine proposal

full rationale

The paper presents a hardware/software co-design proposal for inserting a Tensor Memory Engine into the CPU data path to perform on-the-fly data reorganization while keeping computation on the CPU. No mathematical derivations, equations, parameter fittings, or self-citations appear in the provided text that would reduce any claim to its inputs by construction. The central argument explicitly decouples computing from memory accesses, targets commercial SoC/FPGA platforms, and contrasts with in/near-memory computing without invoking uniqueness theorems, ansatzes, or fitted predictions. The approach is self-contained as a descriptive design proposal rather than a closed-form derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, which introduces the TME as the key new entity without detailing free parameters or additional axioms.

axioms (1)

domain assumption Modern systems require ideal data layouts and spatiotemporal locality for performance.
Stated in the abstract as the basis for the memory wall challenge.

invented entities (1)

Tensor Memory Engine no independent evidence
purpose: To access memory on behalf of CPUs and provide re-organized data views for ideal cache locality.
New hardware component proposed in the paper with no external validation mentioned.

pith-pipeline@v0.9.0 · 5521 in / 1309 out tokens · 97556 ms · 2026-05-10T13:31:54.085653+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 28 canonical work pages · 1 internal anchor

[1]

ht tps://ubuntu.com/download/amd#kria-k26

Install Ubuntu on AMD | Ubuntu — ubuntu.com. ht tps://ubuntu.com/download/amd#kria-k26 . [Ac- cessed 11-12-2025]

2025
[2]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irv- ing, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan- delion Mané, Rajat Monga, Sherry Moore, Derek...
[3]

URL: https://www.tensorflow.org/

Software available from tensorflow.org. URL: https://www.tensorflow.org/
[4]

Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, and Stephen W. Keckler. Page placement strategies for gpus within heterogeneous memory sys- tems. ASPLOS ’15, page 607–618, New York, NY , USA, 2015. Association for Computing Machinery. doi:10.1145/2694344.2694381

work page doi:10.1145/2694344.2694381 2015
[5]

Alonso, T

G. Alonso, T. Roscoe, D. Cock, M. Ewaida, Kaan Kara, Dario Korolija, D. Sidler, and Ze ke Wang. Tackling hardware/software co-design from a database perspec- tive. InConference on Innovative Data Systems Re- search (CIDR), Amsterdam, Netherlands, Jan. 2020. 11

2020
[6]

AMBA AXI and ACE Protocol Specification

ARM. AMBA AXI and ACE Protocol Specification. Technical report, 2013. URL: https://developer.ar m.com/documentation/ihi0022/e/

2013
[7]

Evgeny Bolotin, David Nellans, Oreste Villa, Mike O’Connor, Alex Ramirez, and Stephen W. Keckler. De- signing efficient heterogeneous memory architectures. IEEE Micro, 35(4):60–68, 2015. doi:10.1109/MM.2 015.72

work page doi:10.1109/mm.2 2015
[8]

Jackson, August Ning, Grigory Chirkov, Marcelo Orenes-Vera, Shady Agwa, Xiaoyu Yan, Eric Tang, Jonathan Balkind, Christopher Batten, and David Wentzlaff

Ting-Jung Chang, Ang Li, Fei Gao, Tuan Ta, Georgios Tziantzioulis, Yanghui Ou, Moyang Wang, Jinzheng Tu, Kaifeng Xu, Paul J. Jackson, August Ning, Grigory Chirkov, Marcelo Orenes-Vera, Shady Agwa, Xiaoyu Yan, Eric Tang, Jonathan Balkind, Christopher Batten, and David Wentzlaff. Cifer: A 12nm, 16mm2, 22-core soc with a 1541 lut6/mm2 1.92 mops/lut, fully sy...

work page doi:10.1109/cicc57935.2023.10121294 2023
[9]

Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: an automated end- to-end optimizing compiler for deep learning. In An- drea C. Arpaci-Dusseau and Geoff V oelker, editors,13th USENIX Symposium on Operating Systems Design and...

2018
[10]

URL: https://www.usenix.org/conferenc e/osdi18/presentation/chen
[11]

Cormen, Charles E

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009

2009
[12]

tf.batch_to_space

TensorFlow Developers. tf.batch_to_space. https: //www.tensorflow.org/api_docs/python/tf/ba tch_to_space, 2025. Accessed: 2025-12-01

2025
[13]

Oliveira, Sven Grego- rio, João Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gómez Luna, and Onur Mutlu

Nastaran Hajinazar, Geraldo F. Oliveira, Sven Grego- rio, João Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gómez Luna, and Onur Mutlu. Simdram: An end-to-end framework for bit-serial simd computing in dram, 2021. URL: https://arxiv.org/abs/2105.12839 , arXiv: 2105.12839

work page arXiv 2021
[14]

Work in progress: Identifying unexpected inter-core interference induced by shared cache

Denis Hoornaert, Shahin Roozkhosh, Renato Mancuso, and Marco Caccamo. Work in progress: Identifying unexpected inter-core interference induced by shared cache. In2021 IEEE 27th Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 517–520, 2021. doi:10.1109/RTAS52030.2021.000 66

work page doi:10.1109/rtas52030.2021.000 2021
[15]

Intel. 3.5. RLDRAM II and RLDRAM 3 Features. http s://www.intel.com/content/www/us/en/docs/p rogrammable/710283/17-0/rldram-ii-and-rld ram-3-features.html, 2025. Accessed: 2025-12-01

2025
[16]

Intel’s Stratix 10 FPGA: Supporting the smart and connected revolution, October 2016

Intel, Corp. Intel’s Stratix 10 FPGA: Supporting the smart and connected revolution, October 2016. Ac- cessed on 2022-01-19. URL: https://newsroom .intel.com/editorials/intels-stratix-10-f pga-supporting-smart-connected-revolution/

2016
[17]

Hbm (high bandwidth memory) dram technology and archi- tecture

Hongshin Jun, Jinhee Cho, Kangseol Lee, Ho-Young Son, Kwiwook Kim, Hanho Jin, and Keith Kim. Hbm (high bandwidth memory) dram technology and archi- tecture. In2017 IEEE International Memory Workshop (IMW), pages 1–4, 2017. doi:10.1109/IMW.2017.7 939084

work page doi:10.1109/imw.2017.7 2017
[18]

Gibbons, Guy E

Hongbo Kang, Phillip B. Gibbons, Guy E. Blelloch, Laxman Dhulipala, Yan Gu, and Charles McGuffey. The processing-in-memory model. InProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’21, page 295–306, New York, NY , USA, 2021. Association for Computing Machinery. doi:10.1145/3409964.3461816

work page doi:10.1145/3409964.3461816 2021
[19]

In-datacenter performance analysis of a tensor processing unit,

Sudarsun Kannan, Ada Gavrilovska, Vishal Gupta, and Karsten Schwan. Heteroos — os design for hetero- geneous memory management in datacenter. In2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pages 521–534, 2017. doi:10.1145/3079856.3080245

work page doi:10.1145/3079856.3080245 2017
[20]

Liu Ke, Xuan Zhang, Jinin So, Jong-Geon Lee, Shin- Haeng Kang, Sukhan Lee, Songyi Han, YeonGon Cho, Jin Hyun Kim, Yongsuk Kwon, KyungSoo Kim, Jin Jung, Ilkwon Yun, Sung Joo Park, Hyunsun Park, Joonho Song, Jeonghyeon Cho, Kyomin Sohn, Nam Sung Kim, and Hsien-Hsin S. Lee. Near-memory processing in action: Accelerating personalized recommendation with axdim...

work page arXiv 2022
[21]

Hyeonjin Kim, Sungwoo Ahn, Yunho Oh, Bogil Kim, Won Woo Ro, and William J. Song. Duplo: Lifting re- dundant memory accesses of deep neural networks for gpu tensor cores. In2020 53rd Annual IEEE/ACM In- ternational Symposium on Microarchitecture (MICRO), pages 725–737, 2020. doi:10.1109/MICRO50266.2 020.00065

work page doi:10.1109/micro50266.2 2020
[22]

Kolda and Brett W

Tamara G. Kolda and Brett W. Bader. Tensor decompo- sitions and applications.SIAM Review, 51(3):455–500, 2009.doi:10.1137/07070111X. 12

work page doi:10.1137/07070111x 2009
[23]

Hardware architecture and soft- ware stack for pim based on commercial dram tech- nology : Industrial product

Sukhan Lee, Shin-haeng Kang, Jaehoon Lee, Hyeonsu Kim, Eojin Lee, Seungwoo Seo, Hosang Yoon, Seung- won Lee, Kyounghwan Lim, Hyunsung Shin, Jinhyun Kim, O Seongil, Anand Iyer, David Wang, Kyomin Sohn, and Nam Sung Kim. Hardware architecture and soft- ware stack for pim based on commercial dram tech- nology : Industrial product. In2021 ACM/IEEE 48th Annual...

work page doi:10.1109/isca 2021
[24]

Real-world Processing-in-Memory Systems for Modern Workloads

Juan Gómez Luna, Onur Mutlu, and Ataberk Olgun. Real-world Processing-in-Memory Systems for Modern Workloads . https://ev ents.safari.ethz .ch/isca-pim-tutorial/doku.php, 2023

2023
[25]

Delta: Gpu perfor- mance model for deep learning applications with in- depth memory system traffic analysis

Sangkug Lym, Donghyuk Lee, Mike O’Connor, Niladr- ish Chatterjee, and Mattan Erez. Delta: Gpu perfor- mance model for deep learning applications with in- depth memory system traffic analysis. In2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), page 293–303. IEEE, March 2019. URL: http://dx.doi.org/10.1109/I SPASS....

work page doi:10.1109/i 2019
[26]

PolarFire SoC - Lowest Power, Multi-Core RISC-V SoC FPGA, July 2020

Microsemi — Microchip Technology Inc. PolarFire SoC - Lowest Power, Multi-Core RISC-V SoC FPGA, July 2020. Accessed on 09.01.2020. URL: https: //www.microsemi.com/product-directory/soc -fpgas/5498-polarfire-soc-fpga

2020
[28]

Mutlu, S

Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, Rachata Ausavarungnirun, Mohammad Sadrosadati, and Ger- aldo F. Oliveira. A modern primer on processing in memory, 2025. URL: https://arxiv.org/abs/2012 .03112,arXiv:2012.03112

work page arXiv 2025
[29]

Rt-bench: an ex- tensible benchmark framework for the analysis and man- agement of real-time applications

Mattia Nicolella, Shahin Roozkhosh, Denis Hoornaert, Andrea Bastoni, and Renato Mancuso. Rt-bench: an ex- tensible benchmark framework for the analysis and man- agement of real-time applications. InProceedings of the 30th International Conference on Real-Time Networks and Systems, page 184–195. ACM, June 2022. URL: http://dx.doi.org/10.1145/3534879.353488...

work page doi:10.1145/3534879.3534888 2022
[30]

Nisa Bostanci, Geraldo F

Ataberk Olgun, F. Nisa Bostanci, Geraldo F. Oliveira, Yahya Can Tugrul, Rahul Bera, A. Giray Yaglikci, Hasan Hassan, Oguz Ergin, and Onur Mutlu. Sectored dram: A practical energy-efficient and high-performance fine- grained dram architecture, 2024. URL: https://arxi v.org/abs/2207.13795,arXiv:2207.13795

work page arXiv 2024
[31]

Bridging portability and per- formance in sparse tensor computations using sycl.Con- currency and Computation: Practice and Experience, 37(27-28):e70369, 2025

Daniel Pacheco, Miguel Graça, Filipe Borralho, Leonel Sousa, and Aleksandar Ilic. Bridging portability and per- formance in sparse tensor computations using sycl.Con- currency and Computation: Practice and Experience, 37(27-28):e70369, 2025. URL: https://onlinelibr ary.wiley.com/doi/abs/10.1002/cpe.70369 , ar Xiv:https://onlinelibrary.wiley.com/doi/pd f/1...

work page doi:10.1002/cpe.70369 2025
[32]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...

work page internal anchor Pith review Pith/arXiv arXiv 1912
[33]

A Case for Intelligent RAM.IEEE Micro, 17(2):34–44, 1997

David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. A Case for Intelligent RAM.IEEE Micro, 17(2):34–44, 1997. URL: http://dx.doi.org/10.1109/40.592312

work page doi:10.1109/40.592312 1997
[34]

CAESAR: Coherence-Aided Elective and Seam- less Alternative Routing via on-chip FPGA

Shahin Roozkhosh, Denis Hoornaert, and Renato Man- cuso. CAESAR: Coherence-Aided Elective and Seam- less Alternative Routing via on-chip FPGA. InProceed- ings of the IEEE Real-Time Systems Symposium (RTSS), pages 356–369, 2022. URL: https://doi.org/10.1 109/RTSS55097.2022.00038

work page arXiv 2022
[35]

Relational Memory: Native In-Memory Accesses on Rows and Columns.CoRR, abs/2109.1, 2022

Shahin Roozkhosh, Denis Hoornaert, Ju Hyoung Mun, Tarikul Islam Papon, Ulrich Drepper, Renato Mancuso, and Manos Athanassoulis. Relational Memory: Native In-Memory Accesses on Rows and Columns.CoRR, abs/2109.1, 2022. URL: https://arxiv.org/abs/ 2109.14349

work page arXiv 2022
[36]

Relational Memory: Native In-Memory Accesses on Rows and Columns

Shahin Roozkhosh, Denis Hoornaert, Ju Hyoung Mun, Tarikul Islam Papon, Ahmed Sanaullah, Ulrich Drepper, Renato Mancuso, and Manos Athanassoulis. Relational Memory: Native In-Memory Accesses on Rows and Columns. InProceedings of the International Confer- ence on Extending Database Technology (EDBT), pages 66–79, 2023. URL: https://doi.org/10.48786/e dbt.2023.06

work page doi:10.48786/e 2023
[37]

The Potential of Programmable Logic in the Middle: Cache Bleach- ing

Shahin Roozkhosh and Renato Mancuso. The Potential of Programmable Logic in the Middle: Cache Bleach- ing. InProceedings of the Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 296–309, 2020. URL: https://doi.org/10.1109/ RTAS48715.2020.00006. 13

work page arXiv 2020
[38]

Kim, Michael A

Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie S. Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry. Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology. InProceedings of the Annual IEEE/ACM In- ternational Symposium on Microarchitecture (MICRO), pages 273...

work page arXiv 2017
[39]

Near-memory comput- ing: Past, present, and future.Microprocess

Gagandeep Singh, Lorenzo Chelini, Stefano Corda, Ah- san Javed Awan, Sander Stuijk, Roel Jordans, Henk Cor- poraal, and Albert-Jan Boonstra. Near-memory comput- ing: Past, present, and future.Microprocess. Microsyst., 71(C), November 2019. doi:10.1016/j.micpro.201 9.102868

work page doi:10.1016/j.micpro.201 2019
[40]

Micro-architectural Exploration of the Re- lational Memory Engine (RME) in RISC-V and FireSim

Cole Strickler, Ju Hyoung Mun, Connor Sullivan, Denis Hoornaert, Renato Mancuso, Manos Athanassoulis, and Heechul Yun. Micro-architectural Exploration of the Re- lational Memory Engine (RME) in RISC-V and FireSim. InProceedings of Workshops at the International Con- ference on Very Large Data Bases (VLDB Workshops),
[41]

URL: https://www.vldb.org/2025/Worksh ops/VLDB-Workshops-2025/ADMS/ADMS25-08.pdf

2025
[42]

Tensor decompositions for signal process- ing: Theory, advances, and applications.Signal Process- ing, 238:110191, 2026

Neriman Tokcan, Shakir Showkat Sofi, Van Tien Pham, Clémence Prévost, Sofiane Kharbech, Baptiste Magnier, Thanh Phuong Nguyen, Yassine Zniyed, and Lieven De Lathauwer. Tensor decompositions for signal process- ing: Theory, advances, and applications.Signal Process- ing, 238:110191, 2026. URL: https://www.scienc edirect.com/science/article/pii/S016516842 5...

work page doi:10.1016/j.sigpro.2025.110191 2026
[43]

Taming non-blocking caches to improve isolation in multicore real-time systems

Prathap Kumar Valsan, Heechul Yun, and Farzad Farshchi. Taming non-blocking caches to improve isolation in multicore real-time systems. In2016 IEEE Real-Time and Embedded Technology and Ap- plications Symposium (RTAS), pages 1–12, 2016. doi: 10.1109/RTAS.2016.7461361

work page doi:10.1109/rtas.2016.7461361 2016
[44]

Hitting the mem- ory wall: Implications of the obvious.ACM SIGARCH computer architecture news, 23(1):20–24, 1995

Wm A Wulf and Sally A McKee. Hitting the mem- ory wall: Implications of the obvious.ACM SIGARCH computer architecture news, 23(1):20–24, 1995

1995
[45]

ZCU 102 MPSoC TRM

Xilinx. ZCU 102 MPSoC TRM. https://docs.xilinx.com/r/en-US/ug1085-zynq- ultrascale-trm/Zynq-UltraScale-Device-Technical- Reference-Manual, September 2022. Accessed: 2022-11-08

2022
[46]

The dynamic granularity memory sys- tem

Doe Hyun Yoon, Min Kyu Jeong, Michael Sullivan, and Mattan Erez. The dynamic granularity memory sys- tem. InProceedings of the 39th Annual International Symposium on Computer Architecture, ISCA ’12, page 548–559, USA, 2012. IEEE Computer Society

2012
[47]

Characterizing and demystifying the implicit con- volution algorithm on commercial matrix-multiplication accelerators, 2021

Yangjie Zhou, Mengtian Yang, Cong Guo, Jingwen Leng, Yun Liang, Quan Chen, Minyi Guo, and Yuhao Zhu. Characterizing and demystifying the implicit con- volution algorithm on commercial matrix-multiplication accelerators, 2021. URL: https://arxiv.org/abs/ 2110.03901,arXiv:2110.03901. 14

work page arXiv 2021