RH+: Row-Hit-Optimized Scheduling for PIM-based LLM Inference

Byeong Kil Lee; Jeeho Ryoo; Shafayat Mowla Anik; Yongchan Jung

arxiv: 2606.05511 · v1 · pith:ZU673SBVnew · submitted 2026-06-03 · 💻 cs.ET

RH+: Row-Hit-Optimized Scheduling for PIM-based LLM Inference

Yongchan Jung , Shafayat Mowla Anik , Byeong Kil Lee , Jeeho Ryoo This is my paper

Pith reviewed 2026-06-28 02:17 UTC · model grok-4.3

classification 💻 cs.ET

keywords PIMLLM inferenceGEMVDRAM schedulingrow-hit optimizationHBM3autoregressive decodingenergy efficiency

0 comments

The pith

For PIM LLM inference, a stride change that keeps 32 MAC operations in the same DRAM row delivers 8-12x speedup by masking row cycle time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that GEMV operations in autoregressive LLM decoding on PIM architectures are limited by DRAM row cycle time nRC, which is 10 to 11 times larger than the nCCDAB parameter targeted by prior work. This occurs because standard host-centric address interleaving spreads each all-bank MAC command across different rows, rendering nCCDAB entirely hidden. The authors introduce RH+ scheduling as a minimal change to the access stride that confines 32 consecutive MAC operations to one row. Cycle-accurate simulations on four LLM workloads confirm the resulting gains in performance and efficiency.

Core claim

In GEMV operations that dominate autoregressive decoding, nRC is 10 to 11 times larger than nCCDAB, so the latter is masked and prior nCCDAB-focused optimizations are ineffective; the root cause is host-centric interleaving that forces every all-bank MAC into a different row. RH+ scheduling uses a simple stride change to keep 32 consecutive MAC operations within the same row, yielding 8-12x speedup, over 74% energy reduction, and up to 52x EDP improvement in cycle-accurate simulation across four LLM workloads.

What carries the argument

RH+ scheduling, a stride adjustment in address mapping that confines consecutive MAC operations to the same DRAM row instead of spreading them across rows.

If this is right

nCCDAB-targeted scheduling optimizations provide no benefit for GEMV-dominated autoregressive decoding on PIM.
Row cycle time becomes the dominant constraint that must be addressed through address mapping or scheduling.
RH+ achieves 8-12x speedup, over 74% energy reduction, and up to 52x EDP improvement without hardware modifications.
The performance gap between prior methods and RH+ widens as model size increases because GEMV remains the bottleneck.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same row-hit principle could be applied to other memory-bound workloads whose access patterns resemble GEMV.
Future DRAM address mappings for AI accelerators might be designed around row locality rather than bank parallelism from the start.
Combining RH+ with software-level tiling or model compression could produce additional gains beyond the reported numbers.
Real hardware validation would need to check whether the new stride pattern introduces unexpected refresh or power-delivery side effects not captured in simulation.

Load-bearing premise

The cycle-accurate simulator faithfully reproduces real HBM3 DRAM timing behavior, including any effects of the altered interleaving on power delivery or thermal limits.

What would settle it

Running the same four LLM workloads on physical HBM3-based PIM hardware with the RH+ stride change versus the baseline address mapping and measuring whether the observed speedup falls outside the simulated 8-12x range.

Figures

Figures reproduced from arXiv: 2606.05511 by Byeong Kil Lee, Jeeho Ryoo, Shafayat Mowla Anik, Yongchan Jung.

**Figure 1.** Figure 1: HBM3-PIM bank architecture and nRC vs. nCCDAB across five HBM3 speed grades [4]. address interleaving stride that forces every MAC AB into a different DRAM row. We propose RH+ scheduling, which changes the MAC AB address stride from 64 columns to 1 column, enabling 32 consecutive row hits per activate-precharge cycle. This paper makes three contributions: • We identify nRC, not nCCDAB, as the true GEMV bot… view at source ↗

**Figure 2.** Figure 2: Baseline stride-64 row addressing. Each MAC [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: RH+ mechanism (top) and timing-level speedup (bottom) at the [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 5.** Figure 5: End-to-end energy reduction (%) of RH+. 175B [1] (96 layers, 96 heads, dmodel = 12288), LLaMA65B [12] (80 layers, 64 heads, dmodel = 8192), Megatron-Turing 530B [11] (105 layers, 128 heads, dmodel = 20480), and OPT66B [13] (64 layers, 72 heads, dmodel = 9216). Each model is evaluated with three input/output sequence length pairs (128/2048, 2048/128, 2048/2048) and two batch sizes (1 and 4), yielding 24 c… view at source ↗

**Figure 6.** Figure 6: End-to-end EDP improvement of RH+. 0 20 40 60 80 100 Energy Breakdown (lower better) AttAcc RH+ AttAcc RH+ AttAcc RH+ AttAcc RH+ GPT-175B LLAMA-65B MT-530B OPT-66B Background ACT/PRE/REF Compute Rearrangement [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: Energy breakdown normalized to baseline. [PITH_FULL_IMAGE:figures/full_fig_p004_7.png] view at source ↗

read the original abstract

Large language model inference on processing-in-memory (PIM) architectures promises to break the memory wall by performing multiply-accumulate (MAC) operations directly within HBM3 DRAM banks. Prior work identifies the power constraint timing parameter nCCDAB as the primary performance bottleneck and optimizes scheduling accordingly. We demonstrate that for GEMV operations that dominate autoregressive decoding, the DRAM row cycle time (nRC) is 10 to 11 times larger than nCCDAB. Consequently, nCCDAB is entirely masked, rendering prior nCCDAB-focused optimizations ineffective for these workloads. The root cause is inherited host-centric address interleaving, which forces every all-bank MAC command into a different DRAM row. We propose RH+ scheduling, a simple stride change that keeps 32 consecutive MAC operations within the same row. Cycle-accurate simulation across four LLM workloads shows that RH+ delivers 8-12x speedup, over 74% energy reduction, and up to 52x EDP improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main point is that nRC swamps nCCDAB for GEMV in PIM LLM decoding due to address mapping, so a stride tweak for row hits gives large simulated gains where prior work does not apply.

read the letter

The central claim is that for the GEMV operations that dominate autoregressive LLM decoding on PIM, the row cycle time nRC is 10-11 times nCCDAB, so the nCCDAB-focused scheduling from earlier papers gets masked entirely. The cause is the inherited host-centric interleaving that sends every all-bank MAC command to a new row. RH+ fixes this with a stride adjustment that keeps 32 consecutive MACs inside one row.

This observation looks new relative to the cited nCCDAB work. The paper does a clean job of showing why the prior optimizations fail to transfer to this workload and then demonstrates the fix through cycle-accurate simulation on four LLM models. The reported outcomes are 8-12x speedup, over 74% energy reduction, and up to 52x EDP improvement. The timing-parameter comparison itself is straightforward and uses standard DRAM specs.

The soft spot is that all the numbers come from simulation of modified HBM3 behavior. There is no silicon validation, no sensitivity sweeps on simulator assumptions, and no discussion of whether the new interleaving creates secondary problems in power delivery or thermal limits that the model does not capture. If the simulator underestimates effective row conflicts under the changed mapping, both the masking conclusion and the speedup numbers lose their grounding.

This work is aimed at researchers building or scheduling PIM accelerators for inference. Anyone already looking at HBM-based designs for GEMV-heavy phases will find the workload-specific address-mapping insight useful. It is worth sending to peer review because the scheduling idea is simple to implement and directly testable, even though the quantitative claims will need hardware confirmation before they can be taken as settled.

Referee Report

1 major / 0 minor

Summary. The paper claims that GEMV operations dominating autoregressive LLM decoding on PIM architectures with HBM3 exhibit nRC 10-11x larger than nCCDAB due to host-centric address interleaving that forces each all-bank MAC into a different row; prior nCCDAB-focused optimizations are thus ineffective. It proposes RH+ as a simple stride change to keep 32 consecutive MACs in the same row and reports 8-12x speedup, >74% energy reduction, and up to 52x EDP improvement from cycle-accurate simulation on four LLM workloads.

Significance. If the simulation results hold under real HBM3 behavior with the modified interleaving, the work would be significant for PIM-based LLM inference by shifting focus from nCCDAB to row-hit optimization and demonstrating large, practical gains from a minimal scheduling change. The quantitative speedups, energy, and EDP numbers would directly inform PIM scheduler design for memory-bound workloads.

major comments (1)

[Abstract] Abstract (and implied simulation methodology): the central claims that nRC is 10-11x nCCDAB (masking prior optimizations) and that RH+ yields 8-12x speedup rest entirely on cycle-accurate simulation fidelity for the new address mapping. No real-silicon validation, error bars, or sensitivity analysis to unmodeled effects (power delivery, thermal constraints, or effective nRC under the stride change) is provided; this is load-bearing because any deviation in modeled row-hit behavior would invalidate both the masking conclusion and the reported gains.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for highlighting the importance of simulation fidelity. We address the major comment below, noting that our evaluation follows standard practice for architecture proposals using cycle-accurate DRAM models.

read point-by-point responses

Referee: [Abstract] Abstract (and implied simulation methodology): the central claims that nRC is 10-11x nCCDAB (masking prior optimizations) and that RH+ yields 8-12x speedup rest entirely on cycle-accurate simulation fidelity for the new address mapping. No real-silicon validation, error bars, or sensitivity analysis to unmodeled effects (power delivery, thermal constraints, or effective nRC under the stride change) is provided; this is load-bearing because any deviation in modeled row-hit behavior would invalidate both the masking conclusion and the reported gains.

Authors: The nRC/nCCDAB ratio (10-11x) is taken directly from the JEDEC HBM3 specification and is independent of our simulator; it is a fixed timing parameter that masks nCCDAB-focused optimizations for GEMV regardless of address mapping. The RH+ speedup numbers derive from cycle-accurate simulation that models the DRAM command state machine, row activation, and the stride change in address interleaving. We model row hits by tracking per-bank row buffers under the new mapping, using publicly documented HBM3 timing values. Real-silicon validation of a modified controller interleaving is outside the scope of this simulation study, as is sensitivity analysis to secondary effects such as thermal throttling or power delivery noise. The core row-hit behavior under RH+ follows directly from the command sequence and does not alter the underlying DRAM timing parameters. revision: no

standing simulated objections not resolved

Real-silicon validation of RH+ under modified HBM3 interleaving, including sensitivity to unmodeled effects such as thermal constraints and power delivery

Circularity Check

0 steps flagged

No circularity; claims rest on external cycle-accurate simulation of standard HBM3 parameters

full rationale

The paper derives its core claims (nRC being 10-11x nCCDAB for GEMV, masking of prior optimizations, and 8-12x speedup from RH+ stride change) from cycle-accurate simulation using published HBM3 timing parameters and a proposed address mapping. No equations or results reduce to inputs by construction, no fitted parameters are relabeled as predictions, and no load-bearing self-citations or uniqueness theorems from the same authors appear. The simulation methodology is presented as independent of the target performance numbers, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard DRAM timing parameters (nRC, nCCDAB) taken from JEDEC specifications and the assumption that host-centric interleaving is the default mapping; no new free parameters or invented entities are introduced. The simulation itself is treated as an external benchmark.

axioms (1)

domain assumption Standard HBM3 timing parameters (nRC = 10-11x nCCDAB) apply unchanged under the new address mapping.
Invoked when claiming nCCDAB is masked and when reporting speedups from the stride change.

pith-pipeline@v0.9.1-grok · 5711 in / 1354 out tokens · 23509 ms · 2026-06-28T02:17:29.495923+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 3 linked inside Pith

[1]

Brown et al

T. Brown et al. Language models are few-shot learners.Advances Neural Inf. Process. Syst. (NeurIPS), 33, 2020

2020
[2]

Chandrasekar et al

K. Chandrasekar et al. DRAMPower: Open-source DRAM power & energy estimation tool. InURL: http://www.drampower .info, 2012

2012
[3]

Hwang et al

G. Hwang et al. NeuPIMs: NPU-PIM heterogeneous acceleration for batched LLM inference. InProc. ACM Int. Conf. Archit. Support Program. Lang. Oper . Syst. (ASPLOS), 2024

2024
[4]

JESD238A: High bandwidth memory (HBM3) DRAM, 2023

JEDEC. JESD238A: High bandwidth memory (HBM3) DRAM, 2023

2023
[5]

Jeong et al

Y . Jeong et al. AttAcc: Unleashing the power of PIM for batched transformer-based generative model inference. InProc. ACM Int. Conf. Archit. Support Program. Lang. Oper . Syst. (ASPLOS), 2024

2024
[6]

Kim et al

H. Kim et al. Newton: A DRAM-maker’s accelerator-in-memory (AiM) architecture for machine learning. InProc. IEEE/ACM Int. Symp. Microarchit. (MICRO), 2020

2020
[7]

Lee et al

S. Lee et al. AIM: Energy-efficient aggregation inside the memory hierarchy.ACM Trans. Archit. Code Optim. (TACO), 13(1), 2016

2016
[8]

Lee et al

S. Lee et al. Hardware architecture and software stack for PIM based on commercial DRAM technology: Industrial product. InProc. ACM/IEEE Int. Symp. Comput. Archit. (ISCA), 2021

2021
[9]

Lee et al

S. Lee et al. A 1ynm 1.25v 8gb 16gb/s/pin GDDR6-based accelerator- in-memory supporting 1tflops MAC operation and various activation functions. InProc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2022

2022
[10]

Luo et al

H. Luo et al. Ramulator 2.0: A modern, modular, and extensible DRAM simulator.IEEE Comput. Archit. Lett., 22(2), 2023

2023
[11]

Shoeybi et al

M. Shoeybi et al. Megatron-LM: Training multi-billion param- eter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

Pith/arXiv arXiv 1909
[12]

Touvron et al

H. Touvron et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023
[13]

Zhang et al

S. Zhang et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

Pith/arXiv arXiv 2022
[14]

Zhou et al

C. Zhou et al. TransPIM: A memory-based acceleration via software- hardware co-design for transformers. InProc. IEEE Int. Symp. High- Perform. Comput. Archit. (HPCA), 2022

2022

[1] [1]

Brown et al

T. Brown et al. Language models are few-shot learners.Advances Neural Inf. Process. Syst. (NeurIPS), 33, 2020

2020

[2] [2]

Chandrasekar et al

K. Chandrasekar et al. DRAMPower: Open-source DRAM power & energy estimation tool. InURL: http://www.drampower .info, 2012

2012

[3] [3]

Hwang et al

G. Hwang et al. NeuPIMs: NPU-PIM heterogeneous acceleration for batched LLM inference. InProc. ACM Int. Conf. Archit. Support Program. Lang. Oper . Syst. (ASPLOS), 2024

2024

[4] [4]

JESD238A: High bandwidth memory (HBM3) DRAM, 2023

JEDEC. JESD238A: High bandwidth memory (HBM3) DRAM, 2023

2023

[5] [5]

Jeong et al

Y . Jeong et al. AttAcc: Unleashing the power of PIM for batched transformer-based generative model inference. InProc. ACM Int. Conf. Archit. Support Program. Lang. Oper . Syst. (ASPLOS), 2024

2024

[6] [6]

Kim et al

H. Kim et al. Newton: A DRAM-maker’s accelerator-in-memory (AiM) architecture for machine learning. InProc. IEEE/ACM Int. Symp. Microarchit. (MICRO), 2020

2020

[7] [7]

Lee et al

S. Lee et al. AIM: Energy-efficient aggregation inside the memory hierarchy.ACM Trans. Archit. Code Optim. (TACO), 13(1), 2016

2016

[8] [8]

Lee et al

S. Lee et al. Hardware architecture and software stack for PIM based on commercial DRAM technology: Industrial product. InProc. ACM/IEEE Int. Symp. Comput. Archit. (ISCA), 2021

2021

[9] [9]

Lee et al

S. Lee et al. A 1ynm 1.25v 8gb 16gb/s/pin GDDR6-based accelerator- in-memory supporting 1tflops MAC operation and various activation functions. InProc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2022

2022

[10] [10]

Luo et al

H. Luo et al. Ramulator 2.0: A modern, modular, and extensible DRAM simulator.IEEE Comput. Archit. Lett., 22(2), 2023

2023

[11] [11]

Shoeybi et al

M. Shoeybi et al. Megatron-LM: Training multi-billion param- eter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

Pith/arXiv arXiv 1909

[12] [12]

Touvron et al

H. Touvron et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023

[13] [13]

Zhang et al

S. Zhang et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

Pith/arXiv arXiv 2022

[14] [14]

Zhou et al

C. Zhou et al. TransPIM: A memory-based acceleration via software- hardware co-design for transformers. InProc. IEEE Int. Symp. High- Perform. Comput. Archit. (HPCA), 2022

2022