pith. sign in

arxiv: 2606.05511 · v1 · pith:ZU673SBVnew · submitted 2026-06-03 · 💻 cs.ET

RH+: Row-Hit-Optimized Scheduling for PIM-based LLM Inference

Pith reviewed 2026-06-28 02:17 UTC · model grok-4.3

classification 💻 cs.ET
keywords PIMLLM inferenceGEMVDRAM schedulingrow-hit optimizationHBM3autoregressive decodingenergy efficiency
0
0 comments X

The pith

For PIM LLM inference, a stride change that keeps 32 MAC operations in the same DRAM row delivers 8-12x speedup by masking row cycle time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that GEMV operations in autoregressive LLM decoding on PIM architectures are limited by DRAM row cycle time nRC, which is 10 to 11 times larger than the nCCDAB parameter targeted by prior work. This occurs because standard host-centric address interleaving spreads each all-bank MAC command across different rows, rendering nCCDAB entirely hidden. The authors introduce RH+ scheduling as a minimal change to the access stride that confines 32 consecutive MAC operations to one row. Cycle-accurate simulations on four LLM workloads confirm the resulting gains in performance and efficiency.

Core claim

In GEMV operations that dominate autoregressive decoding, nRC is 10 to 11 times larger than nCCDAB, so the latter is masked and prior nCCDAB-focused optimizations are ineffective; the root cause is host-centric interleaving that forces every all-bank MAC into a different row. RH+ scheduling uses a simple stride change to keep 32 consecutive MAC operations within the same row, yielding 8-12x speedup, over 74% energy reduction, and up to 52x EDP improvement in cycle-accurate simulation across four LLM workloads.

What carries the argument

RH+ scheduling, a stride adjustment in address mapping that confines consecutive MAC operations to the same DRAM row instead of spreading them across rows.

If this is right

  • nCCDAB-targeted scheduling optimizations provide no benefit for GEMV-dominated autoregressive decoding on PIM.
  • Row cycle time becomes the dominant constraint that must be addressed through address mapping or scheduling.
  • RH+ achieves 8-12x speedup, over 74% energy reduction, and up to 52x EDP improvement without hardware modifications.
  • The performance gap between prior methods and RH+ widens as model size increases because GEMV remains the bottleneck.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same row-hit principle could be applied to other memory-bound workloads whose access patterns resemble GEMV.
  • Future DRAM address mappings for AI accelerators might be designed around row locality rather than bank parallelism from the start.
  • Combining RH+ with software-level tiling or model compression could produce additional gains beyond the reported numbers.
  • Real hardware validation would need to check whether the new stride pattern introduces unexpected refresh or power-delivery side effects not captured in simulation.

Load-bearing premise

The cycle-accurate simulator faithfully reproduces real HBM3 DRAM timing behavior, including any effects of the altered interleaving on power delivery or thermal limits.

What would settle it

Running the same four LLM workloads on physical HBM3-based PIM hardware with the RH+ stride change versus the baseline address mapping and measuring whether the observed speedup falls outside the simulated 8-12x range.

Figures

Figures reproduced from arXiv: 2606.05511 by Byeong Kil Lee, Jeeho Ryoo, Shafayat Mowla Anik, Yongchan Jung.

Figure 1
Figure 1. Figure 1: HBM3-PIM bank architecture and nRC vs. nCCDAB across five HBM3 speed grades [4]. address interleaving stride that forces every MAC AB into a different DRAM row. We propose RH+ scheduling, which changes the MAC AB address stride from 64 columns to 1 column, enabling 32 consecutive row hits per activate-precharge cycle. This paper makes three contributions: • We identify nRC, not nCCDAB, as the true GEMV bot… view at source ↗
Figure 2
Figure 2. Figure 2: Baseline stride-64 row addressing. Each MAC [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: RH+ mechanism (top) and timing-level speedup (bottom) at the [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: End-to-end energy reduction (%) of RH+. 175B [1] (96 layers, 96 heads, dmodel = 12288), LLaMA￾65B [12] (80 layers, 64 heads, dmodel = 8192), Megatron-Turing 530B [11] (105 layers, 128 heads, dmodel = 20480), and OPT￾66B [13] (64 layers, 72 heads, dmodel = 9216). Each model is evaluated with three input/output sequence length pairs (128/2048, 2048/128, 2048/2048) and two batch sizes (1 and 4), yielding 24 c… view at source ↗
Figure 6
Figure 6. Figure 6: End-to-end EDP improvement of RH+. 0 20 40 60 80 100 Energy Breakdown (lower better) AttAcc RH+ AttAcc RH+ AttAcc RH+ AttAcc RH+ GPT-175B LLAMA-65B MT-530B OPT-66B Background ACT/PRE/REF Compute Rearrangement [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Energy breakdown normalized to baseline. [PITH_FULL_IMAGE:figures/full_fig_p004_7.png] view at source ↗
read the original abstract

Large language model inference on processing-in-memory (PIM) architectures promises to break the memory wall by performing multiply-accumulate (MAC) operations directly within HBM3 DRAM banks. Prior work identifies the power constraint timing parameter nCCDAB as the primary performance bottleneck and optimizes scheduling accordingly. We demonstrate that for GEMV operations that dominate autoregressive decoding, the DRAM row cycle time (nRC) is 10 to 11 times larger than nCCDAB. Consequently, nCCDAB is entirely masked, rendering prior nCCDAB-focused optimizations ineffective for these workloads. The root cause is inherited host-centric address interleaving, which forces every all-bank MAC command into a different DRAM row. We propose RH+ scheduling, a simple stride change that keeps 32 consecutive MAC operations within the same row. Cycle-accurate simulation across four LLM workloads shows that RH+ delivers 8-12x speedup, over 74% energy reduction, and up to 52x EDP improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that GEMV operations dominating autoregressive LLM decoding on PIM architectures with HBM3 exhibit nRC 10-11x larger than nCCDAB due to host-centric address interleaving that forces each all-bank MAC into a different row; prior nCCDAB-focused optimizations are thus ineffective. It proposes RH+ as a simple stride change to keep 32 consecutive MACs in the same row and reports 8-12x speedup, >74% energy reduction, and up to 52x EDP improvement from cycle-accurate simulation on four LLM workloads.

Significance. If the simulation results hold under real HBM3 behavior with the modified interleaving, the work would be significant for PIM-based LLM inference by shifting focus from nCCDAB to row-hit optimization and demonstrating large, practical gains from a minimal scheduling change. The quantitative speedups, energy, and EDP numbers would directly inform PIM scheduler design for memory-bound workloads.

major comments (1)
  1. [Abstract] Abstract (and implied simulation methodology): the central claims that nRC is 10-11x nCCDAB (masking prior optimizations) and that RH+ yields 8-12x speedup rest entirely on cycle-accurate simulation fidelity for the new address mapping. No real-silicon validation, error bars, or sensitivity analysis to unmodeled effects (power delivery, thermal constraints, or effective nRC under the stride change) is provided; this is load-bearing because any deviation in modeled row-hit behavior would invalidate both the masking conclusion and the reported gains.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for highlighting the importance of simulation fidelity. We address the major comment below, noting that our evaluation follows standard practice for architecture proposals using cycle-accurate DRAM models.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and implied simulation methodology): the central claims that nRC is 10-11x nCCDAB (masking prior optimizations) and that RH+ yields 8-12x speedup rest entirely on cycle-accurate simulation fidelity for the new address mapping. No real-silicon validation, error bars, or sensitivity analysis to unmodeled effects (power delivery, thermal constraints, or effective nRC under the stride change) is provided; this is load-bearing because any deviation in modeled row-hit behavior would invalidate both the masking conclusion and the reported gains.

    Authors: The nRC/nCCDAB ratio (10-11x) is taken directly from the JEDEC HBM3 specification and is independent of our simulator; it is a fixed timing parameter that masks nCCDAB-focused optimizations for GEMV regardless of address mapping. The RH+ speedup numbers derive from cycle-accurate simulation that models the DRAM command state machine, row activation, and the stride change in address interleaving. We model row hits by tracking per-bank row buffers under the new mapping, using publicly documented HBM3 timing values. Real-silicon validation of a modified controller interleaving is outside the scope of this simulation study, as is sensitivity analysis to secondary effects such as thermal throttling or power delivery noise. The core row-hit behavior under RH+ follows directly from the command sequence and does not alter the underlying DRAM timing parameters. revision: no

standing simulated objections not resolved
  • Real-silicon validation of RH+ under modified HBM3 interleaving, including sensitivity to unmodeled effects such as thermal constraints and power delivery

Circularity Check

0 steps flagged

No circularity; claims rest on external cycle-accurate simulation of standard HBM3 parameters

full rationale

The paper derives its core claims (nRC being 10-11x nCCDAB for GEMV, masking of prior optimizations, and 8-12x speedup from RH+ stride change) from cycle-accurate simulation using published HBM3 timing parameters and a proposed address mapping. No equations or results reduce to inputs by construction, no fitted parameters are relabeled as predictions, and no load-bearing self-citations or uniqueness theorems from the same authors appear. The simulation methodology is presented as independent of the target performance numbers, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard DRAM timing parameters (nRC, nCCDAB) taken from JEDEC specifications and the assumption that host-centric interleaving is the default mapping; no new free parameters or invented entities are introduced. The simulation itself is treated as an external benchmark.

axioms (1)
  • domain assumption Standard HBM3 timing parameters (nRC = 10-11x nCCDAB) apply unchanged under the new address mapping.
    Invoked when claiming nCCDAB is masked and when reporting speedups from the stride change.

pith-pipeline@v0.9.1-grok · 5711 in / 1354 out tokens · 23509 ms · 2026-06-28T02:17:29.495923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 3 linked inside Pith

  1. [1]

    Brown et al

    T. Brown et al. Language models are few-shot learners.Advances Neural Inf. Process. Syst. (NeurIPS), 33, 2020

  2. [2]

    Chandrasekar et al

    K. Chandrasekar et al. DRAMPower: Open-source DRAM power & energy estimation tool. InURL: http://www.drampower .info, 2012

  3. [3]

    Hwang et al

    G. Hwang et al. NeuPIMs: NPU-PIM heterogeneous acceleration for batched LLM inference. InProc. ACM Int. Conf. Archit. Support Program. Lang. Oper . Syst. (ASPLOS), 2024

  4. [4]

    JESD238A: High bandwidth memory (HBM3) DRAM, 2023

    JEDEC. JESD238A: High bandwidth memory (HBM3) DRAM, 2023

  5. [5]

    Jeong et al

    Y . Jeong et al. AttAcc: Unleashing the power of PIM for batched transformer-based generative model inference. InProc. ACM Int. Conf. Archit. Support Program. Lang. Oper . Syst. (ASPLOS), 2024

  6. [6]

    Kim et al

    H. Kim et al. Newton: A DRAM-maker’s accelerator-in-memory (AiM) architecture for machine learning. InProc. IEEE/ACM Int. Symp. Microarchit. (MICRO), 2020

  7. [7]

    Lee et al

    S. Lee et al. AIM: Energy-efficient aggregation inside the memory hierarchy.ACM Trans. Archit. Code Optim. (TACO), 13(1), 2016

  8. [8]

    Lee et al

    S. Lee et al. Hardware architecture and software stack for PIM based on commercial DRAM technology: Industrial product. InProc. ACM/IEEE Int. Symp. Comput. Archit. (ISCA), 2021

  9. [9]

    Lee et al

    S. Lee et al. A 1ynm 1.25v 8gb 16gb/s/pin GDDR6-based accelerator- in-memory supporting 1tflops MAC operation and various activation functions. InProc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2022

  10. [10]

    Luo et al

    H. Luo et al. Ramulator 2.0: A modern, modular, and extensible DRAM simulator.IEEE Comput. Archit. Lett., 22(2), 2023

  11. [11]

    Shoeybi et al

    M. Shoeybi et al. Megatron-LM: Training multi-billion param- eter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  12. [12]

    Touvron et al

    H. Touvron et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  13. [13]

    Zhang et al

    S. Zhang et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

  14. [14]

    Zhou et al

    C. Zhou et al. TransPIM: A memory-based acceleration via software- hardware co-design for transformers. InProc. IEEE Int. Symp. High- Perform. Comput. Archit. (HPCA), 2022