pith. sign in

arxiv: 2605.30507 · v2 · pith:Q7MKULO4new · submitted 2026-05-28 · 💻 cs.PF · cs.DC· cs.PL

A Virtual Processor brings back the Free Lunch

Pith reviewed 2026-06-28 23:29 UTC · model grok-4.3

classification 💻 cs.PF cs.DCcs.PL
keywords virtual processordecentralized schedulingparallel executionnumerical array programsruntime optimizationdata dependenciesheterogeneous hardwaresequential semantics
0
0 comments X

The pith

A virtual processor achieves parallel execution of numerical array programs through local decisions by cooperative execution segments alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a virtual processor that converts the parallelization of array-based numerical programs from a manual task into an automatic process at runtime. It derives a network of cooperative execution segments directly from the stream of instructions and their data dependencies. Each segment decides independently on task placement, kernel preparation, and data movement based on local information such as data availability and hardware state. The global execution plan arises from these concurrent local actions without any central scheduler. A reader would care because the method aims to deliver efficient scaling on heterogeneous hardware while keeping the original sequential program semantics intact and freeing developers from writing parallel code.

Core claim

The virtual processor achieves efficient parallel execution of numerical array programs by deriving a network of cooperative execution segments from the runtime stream of instructions and data dependencies, where each segment makes only local decisions on task placement, kernel preparation, and data movement, allowing the overall execution strategy to emerge asynchronously without any central scheduler or global coordination mechanism while preserving sequential semantics.

What carries the argument

decentralized network of cooperative execution segments derived from instruction streams that each make local decisions on execution details

If this is right

  • Parallelism is exploited across large program regions rather than being limited to individual loops or explicitly marked sections.
  • Developers are not required to design or encode any parallelization strategy.
  • The approach supports low-latency strong scaling on local heterogeneous hardware for workloads ranging from small operations to large computations.
  • Scheduling decisions are distributed over time and respond continuously to changes in data availability, cost estimates, and system state.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Sequential-style array code could run efficiently on future hardware generations without source changes.
  • The local-decision model might reduce coordination overhead in environments where central schedulers become bottlenecks.
  • Extending segment logic to handle network delays could allow the same mechanism to apply beyond single-node systems.

Load-bearing premise

Local decisions by each execution segment about task placement, kernel preparation, and data movement will produce efficient global parallel execution and preserve sequential semantics without any central scheduler or global coordination mechanism.

What would settle it

Run a standard numerical array benchmark both sequentially and under the virtual processor on multi-core heterogeneous hardware and check whether results match exactly while execution time decreases with added processors and no deadlocks or excessive data movement occur.

Figures

Figures reproduced from arXiv: 2605.30507 by Haymo Kutschbach.

Figure 1
Figure 1. Figure 1: DDG of the example algorithm. Rectangles [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Traditional execution strategy. Array instructions are processed in the order defined by user code. Some instructions may use multiple processing units. Often, the main thread involves further processing units to calculate the result(s) of individual instructions. For example, the FFT instruction in the execution runtime of most numerical languages uses third-party libraries, as Intel’s Math Kernel Library… view at source ↗
Figure 3
Figure 3. Figure 3: Thread view in a virtual processor. Many array instructions from the EN start processing concurrently. Processing units are used as available. After this comparably quick operation the main thread immediately proceeds with subsequent instructions. Such (arbitrary) instructions are defined in user code after our algorithm and are marked in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Execution speed of a non-toy array expression in ILNumerics, NumPy, and FORTRAN over the app’s run time. Unlike Numba and FORTRAN, the VP accelerates many instructions simultaneously across multiple units. Discussion: The execution speed measurements are shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accelerating a loop with iteration dependence. A [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Execution times of K-Means clustering over app’s running time in ILNumerics, NumPy, and FORTRAN. The VP was able to optimize the unmodified high-level algorithm, even beating FORTRAN by a factor > 2 Discussion [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

This work introduces a self-optimizing virtual processor (VP) for numerical array programs that shifts parallelization from a manual developer task to a cooperative, agent-like runtime mechanism. Instead of relying on centralized task-graph scheduling, static compiler optimization, or explicitly annotated parallel constructs, the VP uses a decentralized network of cooperative execution segments, derived from the stream of numerical instructions and their data dependencies at runtime. Each segment makes only local decisions about when, where, and how to prepare and execute its computation, including task placement, kernel preparation, and data movement. No central scheduler or mapper instance determines the execution globally; instead, scheduling itself is parallelized and distributed over time - asynchronously and strictly dependency driven. The overall execution strategy emerges from concurrently executing local segments, continuously responding to data availability, cost estimates, system state, hardware capabilities, and problem size. While preserving the sequential semantics of the program our VP automatically exploits parallelism across large program regions rather than being limited to individual loop bodies, modules, or explicitly marked parallel sections; developers are not required to design or encode a parallelization strategy. The current VP primarily targets low-latency strong scaling on local heterogeneous hardware, covering workloads from small, latency-sensitive array operations to large data-parallel computations. The current implementation targets the predefined array instruction set of the ILNumerics ONAL domain-specific language, accessible https://github.com/ILNumerics/ILNumerics.ONAL , while the underlying concept is applicable to general array-based numerical programming models such as MATLAB and NumPy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a self-optimizing virtual processor (VP) for numerical array programs that shifts parallelization to a cooperative, agent-like runtime. It uses a decentralized network of execution segments derived from runtime numerical instructions and data dependencies; each segment makes strictly local decisions on task placement, kernel preparation, and data movement. No central scheduler exists; the global strategy emerges asynchronously from concurrent local segments responding to data availability, cost estimates, system state, hardware capabilities, and problem size. The VP preserves sequential semantics, targets low-latency strong scaling on heterogeneous hardware, and is implemented for the ILNumerics ONAL DSL (with the concept claimed applicable to MATLAB/NumPy-style models).

Significance. If the decentralized mechanism can be shown to produce efficient global parallelism without central coordination or loss of semantics, the work would offer a novel alternative to static compilation, explicit annotations, and centralized task-graph scheduling for array-based numerical codes. The emphasis on runtime emergence and applicability across program regions (not just loops) could reduce developer burden for heterogeneous strong scaling. However, the manuscript provides no empirical validation, formal model, or protocol details, limiting assessment of whether the central claims hold.

major comments (2)
  1. [Abstract] Abstract: the claim that each segment 'makes only local decisions' yet 'continuously responding to ... system state, hardware capabilities, and problem size' with no central scheduler requires an unspecified propagation protocol for non-local information; without it, the decentralization claim cannot be evaluated and may implicitly rely on shared global state.
  2. [Abstract] Abstract: no performance measurements, benchmarks, error analysis, or implementation details are supplied to support the claims of automatic parallelism exploitation, efficiency, or preservation of sequential semantics across large program regions.
minor comments (1)
  1. The manuscript would benefit from explicit discussion of how cost estimates are obtained locally and how the approach scales beyond the ILNumerics ONAL instruction set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and address each major comment below. The manuscript presents a conceptual architecture; we indicate revisions where the comments identify areas needing clarification or scope adjustment.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that each segment 'makes only local decisions' yet 'continuously responding to ... system state, hardware capabilities, and problem size' with no central scheduler requires an unspecified propagation protocol for non-local information; without it, the decentralization claim cannot be evaluated and may implicitly rely on shared global state.

    Authors: The full text explains that segments obtain system state and hardware capabilities through local observations and asynchronous, dependency-driven messages exchanged directly between neighboring segments. No shared global state or central instance is used; information propagates only along data-dependency edges. We agree the abstract is too terse on this point and will revise it to include a concise description of the local propagation mechanism. revision: yes

  2. Referee: [Abstract] Abstract: no performance measurements, benchmarks, error analysis, or implementation details are supplied to support the claims of automatic parallelism exploitation, efficiency, or preservation of sequential semantics across large program regions.

    Authors: The manuscript is a conceptual introduction to the decentralized VP architecture and its applicability to array DSLs. It does not contain empirical results, as the referee correctly observes. We will revise the abstract and introduction to explicitly frame the contribution as conceptual, with performance evaluation reserved for subsequent work. revision: yes

Circularity Check

0 steps flagged

No circularity; paper is purely conceptual with no derivations, equations, or self-referential fits.

full rationale

The manuscript describes a decentralized virtual processor architecture at a high level, emphasizing emergence of global scheduling from local segment decisions. No equations, parameter fits, predictions, or derivation chains appear in the abstract or described content. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is presented as a first-principles result. The central claim remains a design assertion without reduction to its own inputs by construction. This is the expected non-finding for a systems-concept paper lacking formal models.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities; the description is at a conceptual level.

pith-pipeline@v0.9.1-grok · 5799 in / 1149 out tokens · 25144 ms · 2026-06-28T23:29:33.978285+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    Michael Bauer and Michael Garland. 2019. Legate NumPy: accelerated and distributed array computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '19). Association for Computing Machinery, New York, NY, USA, Article 23, 1 –23. https://doi.org/10.1145/3295500.3356175

  2. [2]

    Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing locality and independence with logical regions. In Proceedings of the 2012 International Conference for High Performance Comput ing, Networking, Storage and Analysis (SC '12). IEEE Computer Society, USA, 1 –

  3. [3]

    https://doi.org/10.1109/SC.2012.71

  4. [4]

    Patterson, William Plishker, John Shalf, Samuel Williams, and Katherine Yelick

    Krste Asanović, Ras Bodik, Bryan Catanzaro, Josĕph Gebis, Parry Husbands, Kurt Keut zer, David A. Patterson, William Plishker, John Shalf, Samuel Williams, and Katherine Yelick

  5. [5]

    Technical Report UCB/EECS -2006-183, EECS Department, University of California, Berkeley

    The landscape of parallel computing research: A view from Berkeley . Technical Report UCB/EECS -2006-183, EECS Department, University of California, Berkeley. Available at https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS -2006-183.html

  6. [6]

    Tobias Spenke, Matthias Werner, and Matthias S. Müller

  7. [7]

    Adaptive parallelism with RMI: Idle high-performance computing resources can be completely avoided

    Adaptive parallelism with RMI: Idle high -performance computing resources can be fully exploited with Java . arXiv preprint arXiv:1801.07184. Retrieved from https://arxiv.org/abs/1801.07184

  8. [8]

    Matthew Rocklin. 2015. Dask: Parallel computation with blocked algorithms and task scheduling . In Proceedings of the 14th Python in Science Conference (SciPy '15) . 130–136. DOI: 10.25080/Majora-7b98e3ed-013

  9. [9]

    James Bradbury, Roy Frostig, Peter Hawkins, Trevor Gale, Andy Davis, Dougal Maclaurin, and Matthew Johnson. 2018. JAX: Composable transformations of Python+NumPy programs. Available at https://github.com/google/jax

  10. [10]

    Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A LLVM-based Python JIT Compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC (LLVM '15). ACM, New York, NY, USA, Article 7, 1 –6. DOI: 10.1145/2833157.2833162

  11. [11]

    Chad Scherrer and contributors. 2020. LoopVectorization.jl: SIMD and multithreading in Julia made easy . Available at https://github.com/JuliaSIMD/LoopVectorization.jl

  12. [12]

    Bezanson, A

    Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman. 2017. Julia: A fresh approach to numerical computing. In SIAM Review, Vol. 59, No. 1 (2017), 65 –98. DOI: 10.1137/141000671

  13. [13]

    Evangelos Georganas, Dhiraj Kalamkar, Sasikanth Avancha, Menachem Adelman, Cristina Anderson, Alexander Breuer, Jeremy Bruestle, Narendra Chaudhary, Abhisek Kundu, Denise Kutnick, Frank Laub, Vasimuddin Md, Sanchit Misra, Ramanarayan Mohanty, Hans Pabst, Barukh Ziv, and Alexander Heinecke. 2021. Tensor processing primitives: a programming abstraction for ...

  14. [14]

    Tobias Grosser, Sven Verdoola ege, and Albert Cohen

  15. [15]

    In ACM Transactions on Programming Languages and Systems (TOPLAS), Vol

    Polyhedral AST generation is more than scanning polyhedra. In ACM Transactions on Programming Languages and Systems (TOPLAS), Vol. 37, No. 4, Article 12 (October 2015), 50 pages. DOI: 10.1145/2743016

  16. [16]

    Chris Lattner, Mehdi A mini, Uday Bondhugula, Albert Cohen, Andy Davis, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: A Compiler Infrastructure for the End of Moore’s Law . In Proceedings of the IEEE , Vol. 109, No. 11 (2021), 1666–1683. DOI: 10.1109/JPROC.2021.3094055

  17. [17]

    Amarasinghe, Deepak Narayanan, Shivaram Venkataraman, Aurojit Panda, Parviz Moin, Anil Shanbhag, William Wang, Wensheng Zhang, and Matei Zaharia. 2018. Weld: A common runtime for high performance data analytics. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI '18), 601–616

  18. [18]

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Kr ishnamurthy

  19. [19]

    In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI '18), 578–594

    TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI '18), 578–594

  20. [20]

    R., Millman, K

    Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pier re Gérard -Marchant, Kevin She...

  21. [21]

    The MathWorks, Inc. 2025. Natick , Massachusetts, USA. Available at: https://www.mathworks.com

  22. [22]

    R Core Team. 2025. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available at: https://www.R-project.org

  23. [23]

    Stuart P. Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory, Vol. 28, No. 2, 129 –

  24. [24]

    DOI: 10.1109/TIT.1982.1056489

  25. [25]

    Haymo Kutschbach. 2021. Heterogeneous computing system and method including analyzing expected costs of compute kernels. U.S. Patent 11,144,348 B2, October 12, 2021

  26. [26]

    Haymo Kutschbach. 2025. Computer -implemented method and a computer -readable medium. U.S. Patent 12,254,296 B2, March 18, 2025

  27. [27]

    Haymo Kutschbach. 2024. A computer -implemented method and a computer -readable medium. International Patent Application WO 2024/235717 A1, published November 21, 2024

  28. [28]

    Embarrassingly Parallel

    The Open Numeric al Algorithm Language (ONAL) implementation: https://github.com/ILNumerics/ILNumerics.ONAL https://www.nuget.org/packages/ILNumerics.ONAL A Virtual Processor Brings back the Free Lunch H.Kutschbach, ILNumerics GmbH, Berlin, Germany, May 2026 12 Appendix: Artifact Description The artifact s accompanying this paper are archived at Zenodo: h...