arxiv: 2604.10113 · v1 · submitted 2026-04-11 · 💻 cs.DC · cs.AR

Recognition: unknown

FlexVector: A SpMM Vector Processor with Flexible VRF for GCNs on Varying-Sparsity Graphs

Bohan Li , Shengmin Li , Xinyu Shi , Enyi Yao , Francky Catthoor , Simei Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:09 UTC · model grok-4.3

classification 💻 cs.DC cs.AR

keywords SpMMGCN inferencevector processorflexible VRFsparse matrix multiplicationgraph neural networkshardware acceleratorirregular workloads

0 comments

The pith

FlexVector uses row-wise dataflow and flexible vector registers to speed up sparse matrix multiplication for graph convolutional networks on irregular graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes FlexVector as a vector-processor architecture tailored for the two-stage sparse-dense matrix multiplications that define GCN inference. It establishes a row-wise product-based dataflow that regularizes execution by granting full-row access to vector registers, paired with software-managed flexible VRFs that adjust to varying sparsity without rigid banking or cache overhead. A supporting graph-aware preprocessing and node-partitioning step restructures workloads to fit register capacity and reduce memory traffic. If these elements work together, the design would deliver substantially higher performance and lower energy on real graphs whose node degrees follow power-law distributions. Experimental comparisons against cache-centric baselines with equivalent buffer sizes are presented to support the gains.

Core claim

FlexVector accelerates SpMM for GCN inference through a row-wise, product-based dataflow that enables full-row access to vector registers and eliminates the need for multi-banked designs. It employs software-managed flexible VRFs to adapt to irregular access patterns while preserving memory efficiency. Combined with graph-aware preprocessing and node partitioning, this co-design minimizes memory traffic for graphs with varying sparsity.

What carries the argument

Flexible vector register files (VRFs) under software management that adapt to irregular access patterns within a row-wise product-based dataflow.

If this is right

Memory traffic for SpMM operations drops because the row-wise dataflow and VRFs keep more data on-chip.
Vector parallelism is exposed without requiring complex multi-banked register hardware.
The preprocessing step allows the same architecture to handle graphs with different sparsity levels efficiently.
Energy efficiency improves at area parity because unnecessary off-chip accesses are avoided.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same row-wise plus flexible-register pattern could apply to other sparse linear-algebra kernels that exhibit power-law irregularity.
Software control of register allocation may prove more scalable than hardware caching when sparsity varies across inputs.
Hardware designers could explore similar flexible storage structures for emerging graph workloads beyond GCNs.

Load-bearing premise

The graph-aware preprocessing and node partitioning strategy can restructure irregular graph workloads to match the row-wise dataflow and VRF capacity without introducing significant overhead or accuracy loss.

What would settle it

Measuring execution time and energy on the five real-world GCN datasets when running the identical workloads on FlexVector versus the cache-centric baseline with matching buffer sizes would confirm or refute the 3.78x speedup and 40.5 percent energy reduction.

Figures

Figures reproduced from arXiv: 2604.10113 by Bohan Li, Enyi Yao, Francky Catthoor, Shengmin Li, Simei Yang, Xinyu Shi.

**Figure 2.** Figure 2: Power-law distribution of Cora dataset 2) Power-Law of GCN Graph [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Four typical dataflows for computing SpMM. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overall architecture of FlexVector requires only 2 KB, significantly smaller than the hundreds-ofKB caches (e.g., 512 KB) used in GROW [9] (Section VI-A3). FlexVector allows adjusting the sizes of Dense Buffer’s regions via compiled instructions (Section III-D) to optimize buffer utilization for different workloads and scheduling strategies. 2) Flexible VRFs: The flexible VRF mechanism constitutes one of … view at source ↗

**Figure 5.** Figure 5: FlexVector’s coarse-grained ISA for the SpMM com [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of VRF provisioning before and after [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of Execution Modes. (a) Graph topology [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Area breakdown of FlexVector (total: 39.43K [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Ablation study of FlexVector. (a–b) Speedup, energy, and area averaged across five datasets, normalized to a GROW [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Effectiveness of Algorithm 2 on CiteSeer dataset un [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison of GROW-like (GL) and FlexVector (FV) across varying buffer sizes [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 13.** Figure 13: Impact of VRF length (VLEN = 64–2048bit) and depth ( [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗

read the original abstract

Graph Convolutional Networks (GCNs) are widely adopted for tasks involving relational or graph-structured data and can be formulated as two-stage sparse-dense matrix multiplication (SpMM) during inference. However, existing accelerators often struggle with the irregular workloads induced by power-law node degree distributions. In this work, we propose FlexVector, a vector-processor-based architecture that efficiently accelerates SpMM for GCN inference. To address irregular computation patterns, FlexVector adopts a row-wise, product-based dataflow that regularizes SpMM execution and exposes vector parallelism through full-row access to vector registers, eliminating the need for multi-banked register file designs. Building on this dataflow, it introduces software-managed, flexible vector register files (VRFs) that adapt to irregular data access patterns, without sacrificing memory access efficiency. To further exploit these architectural capabilities, we develop a graph-aware preprocessing and node partitioning strategy that restructures irregular graph workloads to better match the row-wise dataflow and VRF capacity. This hardware-software co-design reduces memory traffic, leading to significant performance and energy efficiency gains on real-world GCN workloads. Experimental results on five real-world GCN datasets show that the VRF-centric FlexVector achieves a 3.78x speedup and 40.5% lower energy at comparable area cost relative to a state-of-the-art cache-centric baseline with buffers of the same size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlexVector adds a row-wise dataflow and software-managed flexible VRF to a vector processor for irregular GCN SpMM, paired with graph partitioning, but the 3.78x kernel speedup likely excludes full preprocessing costs.

read the letter

FlexVector's core idea is a vector processor that uses row-wise product-based dataflow so the entire row fits in the VRF at once, plus software-managed flexible VRFs that adjust to varying sparsity without needing complex multi-banked hardware. They add a graph-aware preprocessing step that partitions nodes to match the dataflow and VRF size. This targets the power-law degree problem in GCN inference SpMM more directly than pure cache designs do. The approach keeps the architecture simpler on the register file side while trying to cut memory traffic through the co-design. That combination is the actual new piece beyond standard vector processors or buffer-centric baselines. The reported 3.78x speedup and 40.5% energy reduction on five real datasets against a same-size cache baseline is specific enough to be useful for people building similar accelerators. The numbers come from kernel-level runs after the partitioning step has restructured the workload. The main soft spot is exactly the one the stress-test note flags: the abstract and results do not break out the runtime, energy, or memory cost of the preprocessing and partitioning, nor do they say how often repartitioning would be needed for new graphs. If those costs are left out of the comparison, the net end-to-end advantage over the cache baseline shrinks, especially on dynamic or frequently changing graphs. The paper also gives no error bars, full baseline implementation details, or data exclusion rules, so the performance claim is harder to judge than it should be. This work is aimed at hardware designers working on SpMM accelerators or vector processors for graph ML. A reader who already thinks about dataflow choices for irregular workloads will get concrete ideas from the row-wise access and flexible VRF description. It is worth sending to peer review because the claims are measurable and the co-design angle is worth checking, even though the experiments need tighter accounting of overheads and more transparent baselines.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FlexVector, a vector-processor architecture for accelerating sparse-dense matrix multiplication (SpMM) in GCN inference. It features a row-wise product-based dataflow that enables full-row vector register access, software-managed flexible vector register files (VRFs) to adapt to irregular patterns without multi-banked designs, and a graph-aware preprocessing plus node partitioning strategy to restructure power-law graph workloads. The central empirical claim is that this hardware-software co-design delivers 3.78× speedup and 40.5% lower energy at comparable area versus a cache-centric baseline with equivalent buffers, validated on five real-world GCN datasets.

Significance. If the reported gains hold after full accounting of preprocessing, the work offers a concrete demonstration of VRF-centric design benefits for irregular SpMM, with explicit numerical results on multiple datasets providing empirical grounding for the co-design. This could inform future accelerators targeting varying-sparsity graphs by showing how dataflow regularization and partitioning reduce memory traffic.

major comments (2)

[Abstract and §5] Abstract and §5 (Experimental Results): The 3.78× speedup and 40.5% energy reduction are stated for the SpMM kernel after graph-aware preprocessing and node partitioning have restructured the workloads. The manuscript provides no explicit measurements, amortization analysis, or inclusion of preprocessing runtime/energy/memory-traffic costs in the five-dataset figures, nor clarifies whether repartitioning overhead applies for new graphs. This directly affects the net end-to-end performance claim relative to the cache-centric baseline.
[§5] §5 (Experimental Results): The comparison uses a state-of-the-art cache-centric baseline with buffers of the same size, yet the text does not detail whether the baseline receives identical preprocessing/partitioning, report error bars or run-to-run variance, or specify data exclusion rules. These omissions limit assessment of the robustness of the reported 3.78× and 40.5% figures.

minor comments (2)

[Abstract] The abstract and introduction should explicitly name the five datasets and their key sparsity characteristics (e.g., average degree, power-law exponent) to allow immediate context for the varying-sparsity claims.
[§3] Notation for VRF capacity and partitioning thresholds is introduced in §3 but would benefit from a consolidated table of free parameters and their default values used in the experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Experimental Results): The 3.78× speedup and 40.5% energy reduction are stated for the SpMM kernel after graph-aware preprocessing and node partitioning have restructured the workloads. The manuscript provides no explicit measurements, amortization analysis, or inclusion of preprocessing runtime/energy/memory-traffic costs in the five-dataset figures, nor clarifies whether repartitioning overhead applies for new graphs. This directly affects the net end-to-end performance claim relative to the cache-centric baseline.

Authors: We thank the referee for highlighting this important point. The reported performance and energy figures are indeed for the SpMM kernel after applying the graph-aware preprocessing and node partitioning. Preprocessing is a one-time cost per graph and is amortized over repeated inferences on the same graph, which is common in GCN deployment scenarios. However, to provide a complete evaluation, we will include in the revised manuscript an analysis of the preprocessing overhead in terms of runtime, energy, and memory traffic, along with amortization discussions for both static and dynamic graphs. This will clarify the net end-to-end benefits. revision: yes
Referee: [§5] §5 (Experimental Results): The comparison uses a state-of-the-art cache-centric baseline with buffers of the same size, yet the text does not detail whether the baseline receives identical preprocessing/partitioning, report error bars or run-to-run variance, or specify data exclusion rules. These omissions limit assessment of the robustness of the reported 3.78× and 40.5% figures.

Authors: We confirm that the cache-centric baseline was evaluated with the identical preprocessing and node partitioning strategy to ensure fairness, as these are software-level optimizations. We will explicitly state this in the revised §5. Our experiments are cycle-accurate simulations and thus deterministic with no run-to-run variance; we will note this and omit error bars accordingly. All five datasets were included without any exclusion. We will add these clarifications to improve the robustness assessment. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external experimental comparisons

full rationale

The paper presents an architecture, row-wise product-based dataflow, flexible VRF design, and graph-aware preprocessing/node partitioning as a hardware-software co-design. Its load-bearing claims are experimental: 3.78x speedup and 40.5% lower energy on five real-world GCN datasets versus an external state-of-the-art cache-centric baseline with equivalent buffers. No equations, fitted parameters, or self-citations are shown that reduce any prediction or uniqueness result to the paper's own inputs by construction. The preprocessing strategy is presented as an enabling technique whose overheads are not claimed to be derived from the results themselves. This evaluation is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about GCN workloads and design choices for VRF capacity and partitioning that are not derived from first principles but selected to fit the target graphs.

free parameters (2)

VRF capacity
The size and flexibility parameters of the vector register file are chosen by the designers to balance area, access patterns, and performance for the target sparsity levels.
Node partitioning thresholds
Parameters controlling how the graph is restructured in preprocessing are tuned to match VRF capacity and row-wise dataflow.

axioms (2)

domain assumption GCNs during inference can be formulated as two-stage SpMM operations.
This is stated directly in the abstract as the basis for the targeted workload.
domain assumption Power-law node degree distributions create irregular access patterns that standard cache-centric designs handle poorly.
Invoked to motivate the need for the new row-wise dataflow and flexible VRF.

pith-pipeline@v0.9.0 · 5569 in / 1332 out tokens · 46687 ms · 2026-05-10T16:09:01.397874+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Semi-Supervised Classification with Graph Convolutional Networks

T. Kipf, “Semi-supervised classification with graph convolutional net- works,”arXiv preprint arXiv:1609.02907, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Spectral networks and locally connected networks on graphs,

J. Bruna, W. Zaremba, A. Szlam, and Y . LeCun, “Spectral networks and locally connected networks on graphs,”arXiv preprint arXiv:1312.6203, 2013

work page arXiv 2013
[3]

A new model for learning in graph domains,

M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” inProceedings of the 2005 IEEE International Joint Conference on Neural Networks, vol. 2. IEEE, 2005, pp. 729–734

2005
[4]

Hygcn: A gcn accelerator with hybrid architecture,

M. Yan, L. Deng, X. Hu, L. Liang, Y . Feng, X. Ye, Z. Zhang, D. Fan, and Y . Xie, “Hygcn: A gcn accelerator with hybrid architecture,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 15–29

2020
[5]

Graphact: Accelerating gcn training on cpu- fpga heterogeneous platforms,

H. Zeng and V . Prasanna, “Graphact: Accelerating gcn training on cpu- fpga heterogeneous platforms,” inProceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020, pp. 255–265

2020
[6]

Sgcn: Exploiting compressed-sparse features in deep graph convolutional network accel- erators,

M. Yoo, J. Song, J. Lee, N. Kim, Y . Kim, and J. Lee, “Sgcn: Exploiting compressed-sparse features in deep graph convolutional network accel- erators,” in2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 1–14

2023
[7]

Gcnax: A flexible and energy-efficient accelerator for graph convolutional neural networks,

J. Li, A. Louri, A. Karanth, and R. Bunescu, “Gcnax: A flexible and energy-efficient accelerator for graph convolutional neural networks,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 775–788

2021
[8]

Awb-gcn: A graph convolutional network accelerator with runtime workload rebalancing,

T. Geng, A. Li, R. Shi, C. Wu, T. Wang, Y . Li, P. Haghi, A. Tumeo, S. Che, S. Reinhardtet al., “Awb-gcn: A graph convolutional network accelerator with runtime workload rebalancing,” in2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 922–936

2020
[9]

Grow: A row-stationary sparse-dense gemm accelerator for memory-efficient graph convolutional neural networks,

R. Hwang, M. Kang, J. Lee, D. Kam, Y . Lee, and M. Rhu, “Grow: A row-stationary sparse-dense gemm accelerator for memory-efficient graph convolutional neural networks,” in2023 IEEE International Sym- posium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 42–55

2023
[10]

Pow- erGraph: Distributed Graph-Parallel computation on natural graphs,

J. E. Gonzalez, Y . Low, H. Gu, D. Bickson, and C. Guestrin, “Pow- erGraph: Distributed Graph-Parallel computation on natural graphs,” in 10th USENIX symposium on operating systems design and implementa- tion (OSDI 12), 2012, pp. 17–30

2012
[11]

Multilevel algorithms for partitioning power-law graphs,

A. Abou-Rjeili and G. Karypis, “Multilevel algorithms for partitioning power-law graphs,” inProceedings 20th IEEE International Parallel & Distributed Processing Symposium. IEEE, 2006

2006
[12]

Main-memory triangle computations for very large (sparse (power-law)) graphs,

M. Latapy, “Main-memory triangle computations for very large (sparse (power-law)) graphs,”Theoretical computer science, vol. 407, no. 1-3, pp. 458–473, 2008

2008
[13]

Two fast algorithms for sparse matrices: Multiplica- tion and permuted transposition,

F. G. Gustavson, “Two fast algorithms for sparse matrices: Multiplica- tion and permuted transposition,”ACM Transactions on Mathematical Software (TOMS), vol. 4, no. 3, pp. 250–269, 1978

1978
[14]

Algo- rithm/hardware co-optimization for sparsity-aware spmm acceleration of gnns,

Y . Gao, L. Gong, C. Wang, T. Wang, X. Li, and X. Zhou, “Algo- rithm/hardware co-optimization for sparsity-aware spmm acceleration of gnns,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 12, pp. 4763–4776, 2023

2023
[15]

Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product,

N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang, “Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product,” in2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 766–780

2020
[16]

Rethinking tiling and dataflow for spmm acceleration: A graph transformation framework,

A. G. Ahsaei, L. Yin, S. Tian, F. Ye, F. Yao, and H. Zheng, “Rethinking tiling and dataflow for spmm acceleration: A graph transformation framework,” inProceedings of the 58th IEEE/ACM International Sym- posium on Microarchitecture, 2025, pp. 1535–1548

2025
[17]

Ara: A 1-ghz+ scalable and energy-efficient risc-v vector processor with mul- tiprecision floating-point support in 22-nm fd-soi,

M. Cavalcante, F. Schuiki, F. Zaruba, M. Schaffner, and L. Benini, “Ara: A 1-ghz+ scalable and energy-efficient risc-v vector processor with mul- tiprecision floating-point support in 22-nm fd-soi,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 2, pp. 530–543, 2019

2019
[18]

Vicuna: A timing-predictable risc-v vec- tor coprocessor for scalable parallel computation,

M. Platzer and P. Puschner, “Vicuna: A timing-predictable risc-v vec- tor coprocessor for scalable parallel computation,” in33rd euromicro conference on real-time systems (ECRTS 2021). Schloss Dagstuhl– Leibniz-Zentrum f ¨ur Informatik, 2021, pp. 1–1

2021
[19]

Efficiently running spmv on long vector architectures,

C. G ´omez, F. Mantovani, E. Focht, and M. Casas, “Efficiently running spmv on long vector architectures,” inProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Program- ming, 2021, pp. 292–303

2021
[20]

Indexmac: A custom risc-v vector instruction to accelerate structured-sparse matrix multiplications,

V . Titopoulos, K. Alexandridis, C. Peltekis, C. Nicopoulos, and G. Dimi- trakopoulos, “Indexmac: A custom risc-v vector instruction to accelerate structured-sparse matrix multiplications,” in2024 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2024, pp. 1–6

2024
[21]

A fast and high quality multilevel scheme for partitioning irregular graphs,

G. Karypis and V . Kumar, “A fast and high quality multilevel scheme for partitioning irregular graphs,”SIAM Journal on scientific Computing, vol. 20, no. 1, pp. 359–392, 1998

1998
[22]

Cacti 7: New tools for interconnect exploration in innovative off-chip memories,

R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V . Srinivas, “Cacti 7: New tools for interconnect exploration in innovative off-chip memories,”ACM Transactions on Architecture and Code Optimization (TACO), vol. 14, no. 2, pp. 1–25, 2017

2017
[23]

Highlights of the high-bandwidth memory (HBM) standard,

M. O’Connor, “Highlights of the high-bandwidth memory (HBM) standard,” inMemory Forum Workshop, 2014

2014