pith. machine review for the scientific record. sign in

arxiv: 2605.05496 · v1 · submitted 2026-05-06 · 💻 cs.AR

Recognition: unknown

DICE: Enabling Efficient General-Purpose SIMT Execution with Statically Scheduled Coarse-Grained Reconfigurable Arrays

Authors on Pith no claims yet

Pith reviewed 2026-05-08 15:25 UTC · model grok-4.3

classification 💻 cs.AR
keywords CGRASIMTGPU architectureenergy efficiencystatic schedulingreconfigurable arraysregister filespatial pipelines
0
0 comments X

The pith

A GPU design using statically scheduled coarse-grained reconfigurable arrays achieves 1.77 to 1.90 times better dynamic energy efficiency than conventional SIMD processors while maintaining comparable performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that the energy waste in GPUs from frequent register file accesses and complex control logic can be greatly reduced by replacing their SIMD execution units with coarse-grained reconfigurable arrays that use static scheduling. Threads are dispatched in a pipelined way so that data flows directly between processing elements instead of being stored and retrieved repeatedly. Dynamic elements like variable memory times or branches are managed by breaking the program into p-graphs that span multiple configurations. Optimizations such as double-buffered memory for quick reconfiguration and a unit to combine memory requests from successive threads help hide overheads. A sympathetic reader would care because this suggests a path to more power-efficient parallel computing hardware that still supports the familiar SIMT programming model without losing speed.

Core claim

DICE replaces the SIMD backend of GPUs with minimal-overhead statically scheduled CGRAs. Active threads are dispatched in pipelined fashion onto the CGRA fabric where data flows directly between PEs to reduce RF accesses for intermediate values. Programs with runtime dynamism are compiled into p-graphs by partitioning dynamic dependence edges across separate configurations. Double-buffered configuration memory hides reconfiguration latency, compile-time unrolling boosts utilization, and a temporal memory coalescing unit merges requests from pipelined threads. On standard benchmarks this cuts register file accesses by 68 percent on average, delivering 1.77-1.90x dynamic energy efficiency and

What carries the argument

The p-graph, which partitions dynamic dependence edges across separate CGRA configurations to support static scheduling for operations with runtime variability such as memory loads and control flow.

If this is right

  • Register file accesses decrease by 68 percent on average.
  • CGRA-based processors reach a geometric mean of 1.77-1.90 times the dynamic energy efficiency of conventional streaming multiprocessors.
  • Average power is reduced by 42.0 to 45.9 percent.
  • The complete system delivers performance levels similar to those of traditional GPU designs.
  • Spatial pipeline execution on reconfigurable hardware supports general-purpose SIMT workloads effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach might be adapted to other many-core or accelerator designs facing similar energy bottlenecks from control and storage overhead.
  • Future work could explore automatic compilation tools that better optimize p-graph partitioning for a wider range of applications.
  • Physical implementation would allow verification of whether the simulated savings translate to real silicon under varying workloads.
  • This highlights an opportunity to trade some flexibility in scheduling for lower power in parallel processors.

Load-bearing premise

The performance and energy model used in evaluation correctly captures the behavior of both the baseline SIMD processors and the CGRA-based design, including reconfiguration overheads and memory request merging.

What would settle it

Running the same set of parallel benchmarks on actual hardware built with both the conventional design and the new CGRA approach, using identical computation and memory resources, would falsify the claims if the energy efficiency improvement falls well below 1.5 times or if performance drops significantly.

Figures

Figures reproduced from arXiv: 2605.05496 by Ang Da Lu, Ang Li, Jiayi Wang, Zhichen Zeng.

Figure 1
Figure 1. Figure 1: GPGPU SIMD lockstep execution model vs. DICE CGRA thread view at source ↗
Figure 3
Figure 3. Figure 3: Control-data flow graph example (LD: Load, BB: Basic block) view at source ↗
Figure 4
Figure 4. Figure 4: p-graphs constraints functional (SF) operations according to the arithmetic logic units (ALU) types, with a 1-bit control signal for predicated execution. The interconnect uses statically scheduled, wire￾switched routing similar to the one in AHA [23]. A. p-graph formation To support operations with runtime dynamism, including memory operations with unpredictable, variable latency, and data-dependent contr… view at source ↗
Figure 5
Figure 5. Figure 5: DICE software flow and hardware execution stages view at source ↗
Figure 6
Figure 6. Figure 6: DICE top-level organization. CGRA Processors (CPs) are grouped view at source ↗
Figure 2
Figure 2. Figure 2: Operands are read from the register file into the view at source ↗
Figure 7
Figure 7. Figure 7: CGRA Processor (CP) microarchitecture scheduled to execute a specific p-graph, DICE instantiates a corresponding execution block, referred to as an e-block. An e-block is a dynamic runtime entity that encapsulates the execution of a p-graph for currently active threads in a given CTA. Conceptually, p-graphs and e-blocks are analogous to static and dynamic instructions, respectively, in conventional CPU/GPG… view at source ↗
Figure 8
Figure 8. Figure 8: CGRA Processor (CP) pipeline execution example (M: metadata, B: CGRA bitstream, pg: view at source ↗
Figure 9
Figure 9. Figure 9: Normalized RF accesses (%) (DICE vs. RTX2060S) view at source ↗
Figure 10
Figure 10. Figure 10: Speedup (×) (DICE variants vs. RTX2060S) 0 20 40 60 80 100 RTX2060S Cycle Breakdown (%) NN BFS-1 BFS-2 BPNN-1 BPNN-2 SC GE-1 GE-2 HS PF Average 0 20 40 60 80 100 DICE Cycle Breakdown (%) Stall Conditions Scoreboard Pipe Stall Idle 0 20 40 60 80 100 Avg. Function Unit Utilization (%) RTX2060S Utilization (max=80%)* All time Stall excluded 14.0 3.5 10.8 11.8 13.2 6.3 5.1 10.1 35.6 29.7 14.5 0 20 40 60 80 10… view at source ↗
Figure 11
Figure 11. Figure 11: Cycle breakdown and average functional unit utilization. Top: view at source ↗
Figure 12
Figure 12. Figure 12: Energy breakdown of NN benchmark in (a) RTX2060S system (b) view at source ↗
Figure 13
Figure 13. Figure 13: Energy & power efficiency (DICE CPs vs. RTX2060S SMs) view at source ↗
Figure 16
Figure 16. Figure 16: Speedup (×) of DICE-O48, DICE-O72, and RTX6000 (vs. RTX5000). 0 1 2 3 4 Energy efficiency (×) Energy efficiency (×) NN BFS-1 BFS-2 BPNN-1 BPNN-2 SCGE-1GE-2HS PF GEOMEAN 0 50 100 Power reduction (%) Power reduction (%) (a) DICE-O48 vs. RTX5000 0 1 2 3 4 Energy efficiency (×) Energy efficiency (×) NN BFS-1 BFS-2 BPNN-1 BPNN-2 SCGE-1GE-2HS PF GEOMEAN 0 50 100 Power reduction (%) Power reduction (%) (b) DICE-… view at source ↗
Figure 17
Figure 17. Figure 17: Dynamic energy efficiency and power reduction (DICE CPs vs. view at source ↗
Figure 18
Figure 18. Figure 18: Speedup (×) and RF Accesses Ratio (%) (DICE-UO vs. RTX3070) effectively by increasing the number of CPs, while maintaining superior energy efficiency and power advantages. 3) Performance Comparison Against Newer GPUs: We additionally compare DICE against the RTX3070. Since each RTX3070 SM offers 2× the FP32 throughput of a Turing SM, we scale DICE both up and out (DICE-UO) to match the RTX3070 compute res… view at source ↗
Figure 9
Figure 9. Figure 9 view at source ↗
read the original abstract

While GPUs dominate massively parallel computing through the single-instruction, multiple-thread (SIMT) programming model, their underlying single-instruction, multiple-data (SIMD) execution incurs substantial energy overhead from frequent register file (RF) accesses and complex control logic. We present DICE, a novel architecture that addresses these inefficiencies by replacing the SIMD backend with minimal-overhead, statically scheduled coarse-grained reconfigurable arrays (CGRAs). Unlike SIMD units that execute warps of threads in lockstep, DICE dispatches active threads in a pipelined manner onto the CGRA fabric, where data flow directly between processing elements (PEs), reducing RF accesses for intermediate values. To handle operations with runtime dynamism, such as variable-latency memory loads and data-dependent control flow, while preserving static scheduling, DICE compiles programs into "p-graphs" by partitioning dynamic dependence edges across separate CGRA configurations. DICE further introduces several key optimizations: double-buffered configuration memory to hide reconfiguration latency, compile-time p-graph unrolling to enhance resource utilization, and a temporal memory coalescing unit (TMCU) to merge memory requests from consecutive, pipelined threads. Evaluations on Rodinia benchmarks in Accel-sim demonstrate that DICE reduces register file accesses by 68% on average. With equivalent computation and memory resources, DICE's CGRA Processors (CPs) achieve a geometric mean of 1.77-1.90x dynamic energy efficiency and 42.0%-45.9% average power reduction compared to the modeled NVIDIA Turing Streaming Multiprocessors (SMs), while the full DICE system achieves performance comparable to the modeled Turing GPU baselines. DICE demonstrates that spatial pipeline execution can deliver substantial energy savings without sacrificing performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes DICE, which replaces SIMD execution in GPUs with statically scheduled CGRAs for SIMT workloads. It introduces p-graphs to partition dynamic dependence edges (e.g., from variable-latency loads and control flow) across configurations while preserving static scheduling, plus optimizations including double-buffered configuration memory, compile-time p-graph unrolling, and a temporal memory coalescing unit (TMCU). Cycle-accurate Accel-sim evaluations on Rodinia benchmarks report 68% average RF access reduction; with equivalent resources, DICE CGRA processors achieve 1.77-1.90x dynamic energy efficiency and 42.0-45.9% power reduction versus modeled NVIDIA Turing SMs, while the full system matches baseline performance.

Significance. If the Accel-sim results hold, DICE offers a concrete path to spatial dataflow execution for general-purpose SIMT codes, with substantial RF and power savings that could inform future GPU microarchitectures in energy-constrained settings. The evaluation uses external Rodinia benchmarks and a third-party simulator without fitted parameters or self-referential predictions, which strengthens credibility. However, the absence of hardware measurements or independent model validation limits immediate adoption.

major comments (2)
  1. [§5 (Evaluation)] §5 (Evaluation): The central quantitative claims (1.77-1.90x energy efficiency, 42-45.9% power reduction, 68% RF reduction) are derived solely from Accel-sim runs that extend the simulator to model p-graphs, double-buffered reconfiguration, and the TMCU. No description is given of how energy models for these novel components were constructed, no cross-validation against McPAT/CACTI or alternative simulators is reported, and no hardware prototype exists. This directly affects the reported gains and must be addressed with additional modeling details or sensitivity analysis.
  2. [§4 (Architecture/Design)] §4 (Architecture/Design, TMCU and p-graph sections): The paper states that the TMCU merges requests from pipelined threads and that p-graph partitioning handles dynamism without sacrificing static scheduling, yet no isolated measurements quantify reconfiguration overhead, coalescing latency, or the fraction of the 68% RF reduction attributable to dataflow versus TMCU. These are load-bearing for the claim that spatial pipelines deliver savings without performance loss.
minor comments (2)
  1. [Abstract] Abstract: The two values in the 1.77-1.90x range are not explained (e.g., different resource configurations or benchmark subsets).
  2. [Throughout] Throughout: Ensure first-use definitions for all acronyms (CGRA, SIMT, TMCU, p-graph) and consistent notation for energy versus power metrics.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and will revise the paper to incorporate additional modeling details, experiments, and clarifications as outlined.

read point-by-point responses
  1. Referee: [§5 (Evaluation)] §5 (Evaluation): The central quantitative claims (1.77-1.90x energy efficiency, 42-45.9% power reduction, 68% RF reduction) are derived solely from Accel-sim runs that extend the simulator to model p-graphs, double-buffered reconfiguration, and the TMCU. No description is given of how energy models for these novel components were constructed, no cross-validation against McPAT/CACTI or alternative simulators is reported, and no hardware prototype exists. This directly affects the reported gains and must be addressed with additional modeling details or sensitivity analysis.

    Authors: We agree that expanded details on energy modeling are needed to fully support the reported gains. In the revised manuscript, we will add a dedicated subsection in §5 describing how the energy models for p-graphs, double-buffered configuration memory, and the TMCU were constructed as extensions to Accel-sim's existing power models (calibrated to NVIDIA Turing). We will also include sensitivity analysis on key parameters such as reconfiguration energy and memory access costs to demonstrate robustness of the 1.77-1.90x energy efficiency and 42.0-45.9% power reduction results. As this is a simulation-based architectural study using a third-party cycle-accurate simulator and external Rodinia benchmarks, a hardware prototype is beyond the current scope; however, the use of unmodified external benchmarks and a public simulator strengthens credibility of the evaluation. revision: yes

  2. Referee: [§4 (Architecture/Design)] §4 (Architecture/Design, TMCU and p-graph sections): The paper states that the TMCU merges requests from pipelined threads and that p-graph partitioning handles dynamism without sacrificing static scheduling, yet no isolated measurements quantify reconfiguration overhead, coalescing latency, or the fraction of the 68% RF reduction attributable to dataflow versus TMCU. These are load-bearing for the claim that spatial pipelines deliver savings without performance loss.

    Authors: We concur that isolating component contributions would strengthen the claims regarding spatial pipeline benefits. In the revision, we will augment §5 with new experiments that: (1) measure and report reconfiguration overhead (showing it is hidden by double-buffering), (2) quantify TMCU coalescing latency, and (3) present an ablation study decomposing the 68% average RF access reduction into portions attributable to CGRA dataflow versus TMCU coalescing. These additions will clarify that the primary savings stem from reduced register file accesses due to direct dataflow while maintaining performance parity with the baseline. revision: yes

standing simulated objections not resolved
  • Absence of hardware prototype or silicon measurements to validate the Accel-sim energy models and reported gains.

Circularity Check

0 steps flagged

No circularity; results are empirical simulation outputs on external benchmarks

full rationale

The paper presents an architecture (p-graphs, double-buffered reconfiguration, TMCU) and reports quantitative claims exclusively from Accel-sim runs on Rodinia benchmarks. No equations, fitted parameters, or predictions are defined in terms of the target metrics; the energy-efficiency and power numbers are direct simulation outputs rather than reductions by construction. No self-citations are invoked to justify uniqueness or load-bearing premises, and the evaluation uses a third-party simulator plus standard external workloads. This is a conventional empirical comparison with no self-referential steps in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the fidelity of the Accel-sim energy model and on the effectiveness of two newly introduced mechanisms (p-graphs and TMCU) whose overheads are only characterized in simulation.

axioms (1)
  • domain assumption The Accel-sim simulator accurately captures dynamic energy, power, and performance for both the baseline Turing SM and the proposed CGRA-based design.
    All reported speedups and energy ratios depend on this modeling assumption.
invented entities (2)
  • p-graphs no independent evidence
    purpose: Partition dynamic dependence edges across separate statically scheduled CGRA configurations.
    New compile-time abstraction introduced to handle variable-latency memory and data-dependent control flow.
  • Temporal Memory Coalescing Unit (TMCU) no independent evidence
    purpose: Merge memory requests from consecutive pipelined threads.
    New hardware unit proposed to improve memory bandwidth utilization under the pipelined dispatch model.

pith-pipeline@v0.9.0 · 5632 in / 1517 out tokens · 83478 ms · 2026-05-08T15:25:00.528441+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 12 canonical work pages

  1. [1]

    T. M. Aamodt, W. W. L. Fung, and T. G. Rogers,General-Purpose Graphics Processor Architectures, 1st ed., ser. Synthesis Lectures on Computer Architecture. Cham: Springer, 2018

  2. [2]

    Benchmark-driven models for energy analysis and attribution of GPU-accelerated supercomputing,

    O. Antepara, Z. Zhao, B. Austin, N. Ding, L. Oliker, N. J. Wright, and S. Williams, “Benchmark-driven models for energy analysis and attribution of GPU-accelerated supercomputing,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’25. New York, NY , USA: Association for Computing Machi...

  3. [3]

    Executing a program on the MIT tagged-token dataflow architecture,

    Arvind and R. Nikhil, “Executing a program on the MIT tagged-token dataflow architecture,”IEEE Transactions on Computers, vol. 39, no. 3, pp. 300–318, 1990

  4. [4]

    Design of a fused multiply-add floating-point and integer datapath,

    T. M. Bruintjes, “Design of a fused multiply-add floating-point and integer datapath,” Master’s thesis, University of Twente, Enschede, The Netherlands, 2011. [Online]. Available: https://essay.utwente.nl/ fileshare/file/61055/MSc TM Bruintjes CAES ASCI.pdf#page=121.00

  5. [5]

    GeForce GTX 1660 Ti’s Advanced Shaders Accelerate Performance In The Latest Games,

    A. Burnes, “GeForce GTX 1660 Ti’s Advanced Shaders Accelerate Performance In The Latest Games,”NVIDIA GeForce News, Feb. 2019, accessed March 28, 2026. [Online]. Avail- able: https://www.nvidia.com/en-us/geforce/news/geforce-gtx-1660-ti- advanced-shaders-streaming-multiprocessor/

  6. [6]

    [Online]

    Cadence Design Systems, Inc., 2025. [Online]. Avail- able: https://www.cadence.com/en US/home/tools/digital-design-and- signoff/power-analysis/joules-rtl-power-solution.html

  7. [7]

    Rodinia: A benchmark suite for heterogeneous computing,

    S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in2009 IEEE International Symposium on Workload Characterization (IISWC), 2009, pp. 44–54

  8. [8]

    CGRA-ME: A unified framework for CGRA modelling and exploration,

    S. A. Chin, N. Sakamoto, A. Rui, J. Zhao, J. H. Kim, Y . Hara- Azumi, and J. Anderson, “CGRA-ME: A unified framework for CGRA modelling and exploration,” in2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2017, pp. 184–189

  9. [9]

    Amber: A 16-nm system-on-chip with a coarse-grained reconfigurable array for flexible acceleration of dense linear algebra,

    K. Feng, T. Kong, K. Koul, J. Melchert, A. Carsello, Q. Liu, G. Nyen- gele, M. Strange, K. Zhang, A. Nayak, J. Setter, J. Thomas, K. Sreedhar, P.-H. Chen, N. Bhagdikar, Z. A. Myers, B. D’Agostino, P. Joshi, S. Richardson, C. Torng, M. Horowitz, and P. Raina, “Amber: A 16-nm system-on-chip with a coarse-grained reconfigurable array for flexible acceleratio...

  10. [11]

    Nvidia 12nm Turing TU116 GeForce GTX 1660 Ti Die Shot,

    ——, “Nvidia 12nm Turing TU116 GeForce GTX 1660 Ti Die Shot,” Flickr, 2019, cC0 1.0 / public domain. [Online]. Available: https://www.flickr.com/photos/130561288@N04/47220396142

  11. [12]

    Dynamic warp formation and scheduling for efficient GPU control flow,

    W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, “Dynamic warp formation and scheduling for efficient GPU control flow,” in40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), 2007, pp. 407–420

  12. [13]

    SNAFU: An ultra-low-power, energy-minimal CGRA-generation framework and architecture,

    G. Gobieski, A. O. Atli, K. Mai, B. Lucia, and N. Beckmann, “SNAFU: An ultra-low-power, energy-minimal CGRA-generation framework and architecture,” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 1027–1040

  13. [14]

    RipTide: A programmable, energy-minimal dataflow compiler and architecture,

    G. Gobieski, S. Ghosh, M. Heule, T. Mowry, T. Nowatzki, N. Beckmann, and B. Lucia, “RipTide: A programmable, energy-minimal dataflow compiler and architecture,” in2022 55th IEEE/ACM International Sym- posium on Microarchitecture (MICRO), 2022, pp. 546–564

  14. [15]

    DySER: Unifying functionality and parallelism specialization for energy-efficient computing,

    V . Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim, “DySER: Unifying functionality and parallelism specialization for energy-efficient computing,”IEEE Micro, vol. 32, no. 5, pp. 38–51, 2012

  15. [16]

    Understanding sources of inefficiency in general-purpose chips,

    R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz, “Understanding sources of inefficiency in general-purpose chips,” p. 37–47, 2010. [Online]. Available: https://doi.org/10.1145/1815961.1815968

  16. [17]

    Hauck and A

    S. Hauck and A. DeHon, Eds.,Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation, ser. Systems on Silicon Ser. San Francisco: Morgan Kaufmann, 2007

  17. [18]

    An integrated GPU power and performance model,

    S. Hong and H. Kim, “An integrated GPU power and performance model,” inProceedings of the 37th Annual International Symposium on Computer Architecture, ser. ISCA ’10. New York, NY , USA: Association for Computing Machinery, 2010, p. 280–289. [Online]. Available: https://doi.org/10.1145/1815961.1815998

  18. [19]

    AccelWattch: A power modeling framework for modern GPUs,

    V . Kandiah, S. Peverelle, M. Khairy, J. Pan, A. Manjunath, T. G. Rogers, T. M. Aamodt, and N. Hardavellas, “AccelWattch: A power modeling framework for modern GPUs,” inMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 738–753. [Online]. Available:...

  19. [20]

    HyCUBE: A CGRA with reconfigurable single-cycle multi-hop interconnect,

    M. Karunaratne, A. K. Mohite, T. Mitra, and L.-S. Peh, “HyCUBE: A CGRA with reconfigurable single-cycle multi-hop interconnect,” in2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), 2017, pp. 1–6

  20. [21]

    Accel-Sim: An extensible simulation framework for validated GPU modeling,

    M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-Sim: An extensible simulation framework for validated GPU modeling,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 473–486

  21. [22]

    Moonwalk: NRE optimization in ASIC clouds,

    M. Khazraee, L. Zhang, L. Vega, and M. B. Taylor, “Moonwalk: NRE optimization in ASIC clouds,” inProceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’17. New York, NY , USA: Association for Computing Machinery, 2017, p. 511–526. [Online]. Available: https://doi.org...

  22. [23]

    AHA: An agile approach to the design of coarse-grained reconfigurable accelerators and compilers,

    K. Koul, J. Melchert, K. Sreedhar, L. Truong, G. Nyengele, K. Zhang, Q. Liu, J. Setter, P.-H. Chen, Y . Mei, M. Strange, R. Daly, C. Donovick, A. Carsello, T. Kong, K. Feng, D. Huff, A. Nayak, R. Setaluri, J. Thomas, N. Bhagdikar, D. Durst, Z. Myers, N. Tsiskaridze, S. Richardson, R. Bahr, K. Fatahalian, P. Hanrahan, C. Barrett, M. Horowitz, C. Torng, F. ...

  23. [24]

    Available: https://doi.org/10.1145/3534933

    [Online]. Available: https://doi.org/10.1145/3534933

  24. [25]

    Lockup-free instruction fetch/prefetch cache organization,

    D. Kroft, “Lockup-free instruction fetch/prefetch cache organization,” in Proceedings of the 8th Annual Symposium on Computer Architecture, ser. ISCA ’81. Washington, DC, USA: IEEE Computer Society Press, 1981, p. 81–87

  25. [26]

    GPUWattch: enabling energy optimizations in GPGPUs,

    J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V . J. Reddi, “GPUWattch: enabling energy optimizations in GPGPUs,” inProceedings of the 40th Annual International Symposium on Computer Architecture, ser. ISCA ’13. New York, NY , USA: Association for Computing Machinery, 2013, p. 487–498. [Online]. Available: https://doi.org...

  26. [27]

    NVIDIA Tesla: A unified graphics and computing architecture,

    E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla: A unified graphics and computing architecture,”IEEE Micro, vol. 28, no. 2, pp. 39–55, 2008

  27. [28]

    A Survey of Coarse-Grained Reconfigurable Architecture and Design: Taxonomy, Challenges, and Applications

    L. Liu, J. Zhu, Z. Li, Y . Lu, Y . Deng, J. Han, S. Yin, and S. Wei, “A survey of coarse-grained reconfigurable architecture and design: Taxonomy, challenges, and applications,”ACM Comput. Surv., vol. 52, no. 6, Oct. 2019. [Online]. Available: https://doi.org/10.1145/3357375

  28. [29]

    ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix,

    B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, “ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix,” inField Programmable Logic and Application, P. Y . K. Cheung and G. A. Constantinides, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 61–70

  29. [30]

    Automated design space exploration of CGRA processing element architectures using frequent subgraph analysis,

    J. Melchert, K. Feng, C. Donovick, R. Daly, C. Barrett, M. Horowitz, P. Hanrahan, and P. Raina, “Automated design space exploration of CGRA processing element architectures using frequent subgraph analysis,” 2021. [Online]. Available: https://arxiv.org/abs/2104.14155

  30. [31]

    Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0,

    N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0,” in40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), 2007, pp. 3–14

  31. [32]

    Scalable parallel programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for?

    J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for?”Queue, vol. 6, no. 2, pp. 40–53, Mar. 2008. [Online]. Available: https: //doi.org/10.1145/1365490.1365500

  32. [33]

    Stream-dataflow acceleration,

    T. Nowatzki, V . Gangadhar, N. Ardalani, and K. Sankaralingam, “Stream-dataflow acceleration,” in2017 ACM/IEEE 44th Annual Inter- national Symposium on Computer Architecture (ISCA), 2017, pp. 416– 429

  33. [34]

    NVIDIA Tesla P100,

    NVIDIA Corporation, “NVIDIA Tesla P100,” https://images.nvidia. com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf, 2016, white paper

  34. [35]

    NVIDIA TESLA V100 GPU ARCHITECTURE,

    ——, “NVIDIA TESLA V100 GPU ARCHITECTURE,” https://images.nvidia.com/content/volta-architecture/pdf/volta- architecture-whitepaper.pdf, 2017, white paper

  35. [36]

    NVIDIA TURING GPU ARCHITECTURE,

    ——, “NVIDIA TURING GPU ARCHITECTURE,” Tech. Rep., NVIDIA, Santa Clara, CA, Tech. Rep., 2018, https://images.nvidia. com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing- architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

  36. [37]

    NVIDIA A100 Tensor Core GPU Architecture,

    ——, “NVIDIA A100 Tensor Core GPU Architecture,” https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia- ampere-architecture-whitepaper.pdf, 2020, white paper

  37. [38]

    NVIDIA Ampere GA102 GPU Architecture,

    ——, “NVIDIA Ampere GA102 GPU Architecture,” https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu- architecture-whitepaper-v2.pdf, 2020, white paper

  38. [39]

    NVIDIA H100 Tensor Core GPU Architecture,

    ——, “NVIDIA H100 Tensor Core GPU Architecture,” https://resources. nvidia.com/en-us-hopper-architecture/nvidia-h100-tensor-c, 2022, white paper

  39. [40]

    ——,Parallel Thread Execution (PTX) ISA Version 8.8, 2024, https: //docs.nvidia.com/cuda/parallel-thread-execution/

  40. [41]

    CUDA Toolkit Documentation: NVCC, the CUDA Compiler Driver,

    ——, “CUDA Toolkit Documentation: NVCC, the CUDA Compiler Driver,” https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index. html, June 2025, accessed: 2025-07-30

  41. [42]

    Swizzle Inventor: Data movement synthesis for GPU kernels,

    P. M. Phothilimthana, A. S. Elliott, A. Wang, A. Jangda, B. Hagedorn, H. Barthels, S. J. Kaufman, V . Grover, E. Torlak, and R. Bodik, “Swizzle Inventor: Data movement synthesis for GPU kernels,” inProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’19. New York, ...

  42. [43]

    A survey on coarse-grained reconfigurable architectures from a performance perspective,

    A. Podobas, K. Sano, and S. Matsuoka, “A survey on coarse-grained reconfigurable architectures from a performance perspective,”IEEE Access, vol. 8, pp. 146 719–146 743, 2020

  43. [44]

    Plasticine: A reconfigurable architecture for parallel patterns,

    R. Prabhakar, Y . Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A reconfigurable architecture for parallel patterns,” in2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017, pp. 389–402

  44. [45]

    SCALE-Sim V3: a modular cycle-accurate systolic accelerator simulator for end-to-end system analysis,

    R. Raj, S. Banerjee, N. Chandra, Z. Wan, J. Tong, A. Samajdhar, and T. Krishna, “SCALE-Sim V3: a modular cycle-accurate systolic accelerator simulator for end-to-end system analysis,” in2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2025, pp. 186–200

  45. [46]

    Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture,

    K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. Keckler, and C. Moore, “Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture,” in30th Annual International Symposium on Computer Architecture, 2003. Proceedings., 2003, pp. 422–433

  46. [47]

    Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm,

    A. Stillmaker and B. Baas, “Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm,”Integration, vol. 58, pp. 74–81, 2017. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167926017300755

  47. [48]

    FreePDK: An open-source variation-aware design kit,

    J. E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. R. Davis, P. D. Franzon, M. Bucher, S. Basavarajaiah, J. Oh, and R. Jenkal, “FreePDK: An open-source variation-aware design kit,” in2007 IEEE International Conference on Microelectronic Systems Education (MSE’07), 2007, pp. 173–174

  48. [49]

    WaveScalar,

    S. Swanson, K. Michelson, A. Schwerin, and M. Oskin, “WaveScalar,” inProceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36., 2003, pp. 291–302

  49. [50]

    Vec- PAC: A vectorizable and precision-aware CGRA,

    C. Tan, D. Patil, A. Tumeo, G. Weisz, S. Reinhardt, and J. Zhang, “Vec- PAC: A vectorizable and precision-aware CGRA,” in2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2023, pp. 1–9

  50. [51]

    OpenCGRA: An open-source unified framework for modeling, testing, and evaluating CGRAs,

    C. Tan, C. Xie, A. Li, K. J. Barker, and A. Tumeo, “OpenCGRA: An open-source unified framework for modeling, testing, and evaluating CGRAs,” in2020 IEEE 38th International Conference on Computer Design (ICCD), 2020, pp. 381–388

  51. [52]

    The end of Moore’s law: A new begin- ning for information technology,

    T. N. Theis and H.-S. P. Wong, “The end of Moore’s law: A new begin- ning for information technology,”Computing in Science & Engineering, vol. 19, no. 2, pp. 41–50, 2017

  52. [53]

    Single-graph multiple flows: energy efficient design alternative for GPGPUs,

    D. V oitsechov and Y . Etsion, “Single-graph multiple flows: energy efficient design alternative for GPGPUs,” p. 205–216, 2014

  53. [54]

    Control flow coalescing on a hybrid dataflow/von Neumann GPGPU,

    ——, “Control flow coalescing on a hybrid dataflow/von Neumann GPGPU,” in2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2015, pp. 216–227

  54. [55]

    Inter-Thread Communication in Multithreaded, Reconfigurable Coarse-Grain Arrays,

    D. V oitsechov, O. Port, and Y . Etsion, “Inter-Thread Communication in Multithreaded, Reconfigurable Coarse-Grain Arrays,” in2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MI- CRO), 2018, pp. 42–54

  55. [56]

    A prototype data flow computer with token labelling,

    I. Watson and J. Gurd, “A prototype data flow computer with token labelling,” in1979 International Workshop on Managing Requirements Knowledge (MARK), 1979, pp. 623–628

  56. [57]

    A hybrid systolic- dataflow architecture for inductive matrix algorithms,

    J. Weng, S. Liu, Z. Wang, V . Dadu, and T. Nowatzki, “A hybrid systolic- dataflow architecture for inductive matrix algorithms,” in2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 703–716