arxiv: 2605.10860 · v1 · submitted 2026-05-11 · 💻 cs.DC

Recognition: no theorem link

Closer in the Gap: Towards Portable Performance on RISC-V Vector Processors

Ruimin Shi , Maya Gokhale , Pei-Hung Lin , Xavier Teruel , Ivy Peng

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:01 UTC · model grok-4.3

classification 💻 cs.DC

keywords RISC-V Vector ExtensionautovectorizationGCCLLVMHPC proxy applicationsperformance countersmicrobenchmarksquantum simulator

0 comments

The pith

GCC 15 produces faster vector code than LLVM 21 in four of six HPC and ML proxies on real RISC-V hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors build assembly microbenchmarks to measure performance ceilings and validate hardware counters on RVV 1.0 processors. These benchmarks expose that predication and stride loads create overheads current compiler models do not fully capture. When the same compilers auto-vectorize six proxy applications, GCC 15 wins in four cases while LLVM 21 wins only in the two matrix kernels through greater instruction reduction, confirmed by the calibrated counters. Default vector length multiplier choices already sit near optimal performance. The evaluation of a full quantum simulator further shows that both compilers still handle complex memory patterns poorly.

Core claim

Through a suite of assembly microbenchmarks the work establishes performance ceilings on real RVV hardware and calibrates perf counters to isolate predication and stride-load costs. In six HPC and ML proxy applications GCC 15 outperforms LLVM 21 in four cases; LLVM 21 only leads in SGEMM and DGEMM because it reduces instruction count more aggressively, as the counters confirm. Default LMUL selection performs close to the best manual choice. In Google's Qsim both auto-vectorization and manual intrinsics expose compiler immaturity with complicated memory access patterns.

What carries the argument

assembly microbenchmarks that set performance ceilings and calibrate performance counters on RVV hardware

If this is right

Compiler cost models need explicit terms for predication and stride-load overheads to improve code generation on RVV.
Default LMUL settings can be trusted for near-optimal results without per-application tuning.
Applications with irregular memory access will continue to require manual intrinsics until compiler support matures.
Validated hardware counters provide a reliable way to compare instruction reduction across compiler versions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar microbenchmark suites could be used to diagnose compiler quality on other new vector extensions.
The observed gaps suggest that portable performance on RVV will improve most quickly by targeting cost models for memory patterns rather than by changing default settings.
Extending the same validation approach to additional scientific codes could identify further common bottlenecks.

Load-bearing premise

The assembly microbenchmarks accurately represent the performance bottlenecks in the proxy applications and the calibrated counters correctly isolate predication and stride-load overheads without confounding hardware or measurement effects.

What would settle it

If detailed tracing of the proxy applications shows that the fractions of cycles spent on predication and stride loads deviate markedly from the proportions measured in the microbenchmarks, or if the relative performance ordering of GCC 15 and LLVM 21 reverses under the same counter validation.

Figures

Figures reproduced from arXiv: 2605.10860 by Ivy Peng, Maya Gokhale, Pei-Hung Lin, Ruimin Shi, Xavier Teruel.

**Figure 3.** Figure 3: Compare the performance of tailing elements via setvl and mask operations on BPI-F3 and Jupiter [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The peak throughput of selected vector and scalar arithmetic instructions [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: The performance by GCC 15 and Clang 21 autovectorization across six [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: The breakdown load/store instructions in BPI-F3 [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: The impact of LMULs selection on Jupiter, normalized by GCC 15 nonvec [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 9.** Figure 9: The comparison of Qsim across 3 versions using 8 cores LMUL up to LMUL = 4, reaching approximately 2.0× and 1.6× respectively. One hypothesis is that the conservative unrolled and vectorized loop strategy in GCC 15 allows it to better tolerate the higher register pressure caused by larger LMUL. Stream and SpMV remain near or below 1.0× across all LMUL values for both compilers. This is expected because the… view at source ↗

read the original abstract

The RISC-V Vector Extension~(RVV) is a cornerstone for supporting compute throughout in scientific and machine learning workloads. Yet compiler support and performance monitoring on real RVV~1.0 hardware are still evolving. In this work, we design a suite of assembly microbenchmarks to establish performance ceilings and calibrate performance counters on RVV hardware. Leveraging the assembly benchmarks, we find that predication overhead and stride load pose performance challenges that current compiler cost models do not yet fully address. Moreover, we present the first evaluation of GCC~15 and LLVM~21 autovectorization in HPC and ML proxy applications. GCC~15 outperforms LLVM~21 in four out of six applications. LLVM~21 only outperforms GCC~15 in SGEMM and DGEMM, driven by more aggressive instruction reduction confirmed through validated \texttt{perf} counters on the RVV hardware. We further show that the default LMUL selection in compilers performs close to the optimal. To study the RVV support for product-level application, we also evaluate the state-vector quantum simulator, Google's Qsim, with both manual RVV intrinsics and compiler auto-vectorization, revealing immaturity in current RVV compiler for complicated memory access pattern.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GCC 15 beats LLVM 21 on four of six proxies on real RVV hardware, but the microbenchmark calibration may not cleanly explain gaps in codes with complex memory access.

read the letter

This paper's main point is that GCC 15 currently produces faster autovectorized code than LLVM 21 on four out of six HPC and ML proxy applications when run on actual RISC-V vector hardware. LLVM only wins on the GEMM cases, and the authors tie that to greater instruction reduction measured with calibrated perf counters. They also report that default LMUL choices are already near optimal and that current compilers struggle with irregular memory patterns in something like Qsim.

Referee Report

1 major / 2 minor

Summary. The paper introduces a suite of assembly microbenchmarks to calibrate performance counters on real RISC-V Vector (RVV) hardware and identify bottlenecks such as predication overhead and stride loads that current compiler cost models do not fully address. It then provides the first evaluation of GCC 15 versus LLVM 21 autovectorization on six HPC and ML proxy applications, reporting that GCC outperforms LLVM in four cases while LLVM is superior only in SGEMM and DGEMM due to greater instruction reduction (confirmed via the calibrated perf counters). The work further shows that default LMUL selection is close to optimal and evaluates both manual intrinsics and auto-vectorization on Google's Qsim, concluding that current RVV compilers remain immature for complicated memory access patterns.

Significance. If the empirical claims hold, this study supplies timely, hardware-grounded data on RVV compiler maturity for portable performance in scientific and ML workloads. The microbenchmark suite and counter calibration methodology are reusable assets that can help the community quantify gaps, while the proxy-app and Qsim results highlight concrete compiler limitations (e.g., handling irregular memory) that should guide future optimization efforts. The focus on real RVV 1.0 hardware rather than simulation adds practical value.

major comments (1)

§3 (Microbenchmarks and perf-counter calibration): the claim that the assembly microbenchmarks isolate predication and stride-load overheads without confounding factors is load-bearing for the central attribution of GCC's outperformance to instruction reduction. The paper does not demonstrate that these microbenchmarks reproduce the dynamic VL changes, cache-line effects, or irregular stride patterns present in Qsim and the other proxies; without such cross-validation or profile comparison, the counters cannot cleanly support the performance-delta explanation.

minor comments (2)

Abstract, final sentence: grammatical issues ('immaturity in current RVV compiler for complicated memory access pattern') should be corrected to 'immaturity in current RVV compilers for complicated memory access patterns'.
Evaluation sections: several figures lack error bars or mention of run-to-run variability, which would help readers assess the stability of the reported speedups and instruction-count deltas.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the practical value of the microbenchmark suite and the hardware-grounded evaluation. We address the major comment below with clarifications and a commitment to targeted revisions.

read point-by-point responses

Referee: [—] §3 (Microbenchmarks and perf-counter calibration): the claim that the assembly microbenchmarks isolate predication and stride-load overheads without confounding factors is load-bearing for the central attribution of GCC's outperformance to instruction reduction. The paper does not demonstrate that these microbenchmarks reproduce the dynamic VL changes, cache-line effects, or irregular stride patterns present in Qsim and the other proxies; without such cross-validation or profile comparison, the counters cannot cleanly support the performance-delta explanation.

Authors: We thank the referee for this observation. The §3 microbenchmarks use hand-written RVV assembly to isolate the incremental latency of predication (via controlled mask application and varying active vector lengths) and stride loads (via explicit stride values in vld instructions) while holding other factors constant. This controlled construction supports counter calibration for those specific events. We acknowledge that the manuscript does not include direct cross-validation—such as extracting and comparing dynamic VL distributions, cache-line access patterns, or stride histograms from the proxy applications and Qsim against the microbenchmark cases. The microbenchmarks target representative instances of these overheads rather than exact replication of application traces. On attribution, the paper links LLVM’s advantage in SGEMM/DGEMM to greater instruction reduction measured directly in those applications via the calibrated counters; GCC’s advantages in the remaining four proxies are tied to the microbenchmark finding that compiler cost models under-account for predication and stride costs. In the revision we will add (1) a subsection in §3 on the representativeness of the chosen patterns, informed by static inspection of the proxy loop structures, (2) supplementary figures correlating application-level perf counter readings with the isolated overheads, and (3) explicit wording clarifying the scope of the explanations. These changes will make the linkage between microbenchmarks and application results more transparent. revision: partial

Circularity Check

0 steps flagged

No circularity: pure empirical measurement study

full rationale

The paper performs direct hardware measurements using custom assembly microbenchmarks to calibrate perf counters for predication and stride-load overheads, then evaluates GCC 15 and LLVM 21 on six proxy applications plus Qsim via actual runs on RVV hardware. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or uniqueness theorems appear in the derivation chain. All performance deltas (e.g., instruction reduction in GEMM) are attributed to observed counter values from the experiments themselves, rendering the study self-contained against external benchmarks with no reduction to prior inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical performance analysis and introduces no new theoretical entities or fitted parameters; it relies on standard domain assumptions about hardware measurement validity.

axioms (1)

domain assumption Calibrated performance counters on RVV hardware accurately reflect execution-time overheads such as predication and stride loads.
Invoked when the paper uses counter data to diagnose compiler cost-model gaps and to compare GCC vs LLVM instruction reduction.

pith-pipeline@v0.9.0 · 5525 in / 1331 out tokens · 76170 ms · 2026-05-12T04:01:20.333652+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

IEEE Micro42(5), 41–48 (2022)

Adit, N., Sampson, A.: Performance left on the table: An evaluation of compiler autovectorization for risc-v. IEEE Micro42(5), 41–48 (2022)

work page 2022
[2]

https://github.com/riscv/riscv-v- spec/releases/tag/v1.0 (2021)

Asanovic., K.: Vector Extension 1.0. https://github.com/riscv/riscv-v- spec/releases/tag/v1.0 (2021)

work page 2021
[3]

In: International Conference on High Performance Computing

Banchelli, F., et al.: Risc-v in hpc: a look into tools for performance monitoring. In: International Conference on High Performance Computing. pp. 562–575 (2025)

work page 2025
[4]

https://github.com/camel-cdr/rvv- bench

Bernstein, O.: RISC-V Vector benchmark. https://github.com/camel-cdr/rvv- bench

work page
[5]

In: Proc

Brown, N., et al.: Is RISC-V ready for hpc prime-time: Evaluating the 64-core sophon SG2042 RISC-V CPU. In: Proc. SC’23 Workshops. pp. 1566–1574 (2023)

work page 2023
[6]

In: 2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)

Carpentieri, et al.: A performance analysis of autovectorization on rvv risc-v boards. In: 2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP). pp. 129–136 (2025) Performance on RISC-V RVV 15

work page 2025
[7]

Future Generation Computer Systems p

Garcia, A.M., et al.: Inference performance of large language models on a 64-core risc-v cpu with silicon-enabled vectors. Future Generation Computer Systems p. 108242 (2025)

work page 2025
[8]

In: 2023 IEEE International Parallel and Distributed Processing Sym- posium (IPDPS)

Gupta, S.R., et al.: Accelerating cnn inference on long vector architectures via co-design. In: 2023 IEEE International Parallel and Distributed Processing Sym- posium (IPDPS). pp. 145–155. IEEE (2023)

work page 2023
[9]

arXiv preprint arXiv:2510.10119 (2025)

Han, L., et al.: Intrintrans: Llm-based intrinsic code translator for risc-v vector. arXiv preprint arXiv:2510.10119 (2025)

work page arXiv 2025
[10]

In: International Conference on High Performance Computing

Lee, J.K., et al.: Test-driving risc-v vector hardware for hpc. In: International Conference on High Performance Computing. pp. 419–432. Springer (2023)

work page 2023
[11]

Journal of Computer Science and Technology38(4), 807–820 (2023)

Li, R.S., et al.: Evaluating risc-v vector instruction set architecture extension with computer vision workloads. Journal of Computer Science and Technology38(4), 807–820 (2023)

work page 2023
[12]

In: Workshop Proceedings of the 53rd International Conference on Parallel Processing

Lin, J.K., et al.: Rewriting and optimizing vector length agnostic intrinsics from arm sve to rvv. In: Workshop Proceedings of the 53rd International Conference on Parallel Processing. pp. 38–47 (2024)

work page 2024
[13]

Peccia,F.N.,Haxel,F.,Bringmann,O.:TensorprogramoptimizationfortheRISC- Vvectorextensionusingprobabilisticprograms.In:2025IEEE/ACMInternational Conference On Computer Aided Design (ICCAD). pp. 1–9. IEEE (2025)

work page 2025
[14]

In: ASAP

Perotti,M.,etal.:A“newara” forvectorcomputing:Anopensourcehighlyefficient risc-v v 1.0 vector processor design. In: ASAP. IEEE (2022)

work page 2022
[15]

Quantum AI team: qsim (Jun 2025), https://doi.org/10.5281/zenodo.4067237

work page doi:10.5281/zenodo.4067237 2025
[16]

ACM Transactions on Architecture and Code Op- timization (TACO)17(4), 1–30 (2020)

Ramírez, C., et al.: A risc-v simulator and benchmark suite for designing and evaluating vector architectures. ACM Transactions on Architecture and Code Op- timization (TACO)17(4), 1–30 (2020)

work page 2020
[17]

Future Generation Computer Systems p

Vizcaino, P., et al.: Designing a qemu plugin to profile multicore long vector risc-v architectures: Rave. Future Generation Computer Systems p. 108100 (2025)

work page 2025