pith. machine review for the scientific record. sign in

arxiv: 2604.21749 · v2 · submitted 2026-04-23 · 💻 cs.GR

Recognition: unknown

CuRast: Cuda-Based Software Rasterization for Billions of Triangles

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:05 UTC · model grok-4.3

classification 💻 cs.GR
keywords software rasterizationCUDAcompute shaderslarge triangle meshesphotogrammetryVulkanatomic operationsdepth buffering
0
0 comments X

The pith

A CUDA compute shader pipeline rasterizes dense meshes with hundreds of millions of triangles up to 12 times faster than Vulkan.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper is trying to establish that software rasterization in CUDA can outperform Vulkan hardware rasterization for very large, dense triangle models. It does so with a three-stage pipeline that rasterizes small triangles directly using atomicMin for depth and forwards larger triangles to later stages, all without building acceleration structures in advance. A sympathetic reader would care because photogrammetry and 3D reconstruction frequently produce exactly these dense, opaque meshes where traditional pipelines incur overhead from data structures.

Core claim

CuRast implements a 3-stage rasterization pipeline in CUDA that rasterizes small triangles directly in stage 1 with atomicMin to store the closest fragments and forwards larger triangles to stages 2 and 3. This allows rendering of models with hundreds of millions of triangles up to 2-5x faster than Vulkan for unique geometry and up to 12x for instanced geometry, without the need to construct acceleration structures beforehand.

What carries the argument

The 3-stage compute shader pipeline that classifies triangles by size and uses atomicMin depth writes for small triangles in the first stage.

If this is right

  • Dense opaque meshes can be rendered substantially faster with this compute approach than with Vulkan.
  • Instanced geometry receives the largest speedups.
  • No acceleration structure construction step is required even at hundreds of millions of triangles.
  • Scenes with thousands of low-poly meshes remain slower than Vulkan.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Porting the size-based staging logic to other compute APIs could extend the speedups to more hardware platforms.
  • Adding a separate transparency stage would preserve the opaque path speed while widening applicability.
  • Pairing the method with clustered LOD generation could push the practical limit toward billions of triangles.

Load-bearing premise

The input consists of dense, opaque meshes from photogrammetry or reconstruction without transparency, blending, or thousands of separate low-poly objects.

What would settle it

A side-by-side frame-time measurement on the same GPU for a 200-million-triangle photogrammetry model rendered once with CuRast and once with Vulkan.

Figures

Figures reproduced from arXiv: 2604.21749 by Elias Kristmann, Lukas Lipp, Markus Sch\"utz, Michael Wimmer.

Figure 1
Figure 1. Figure 1: Brute-force rendering the Zorah data set on an RTX 5090. 38.8GB of geometry loaded from an SSD, compressed to 21.7 GB, transferred to GPU and ready to render in 6.6 seconds. 18.9 billion triangles total; 13.5 billion triangles visible after frustum culling. Rendered in 67.3 milliseconds into a 3840×2160 framebuffer with screen space ambient occlusion and eye-dome lighting enabled. Abstract Previous work sh… view at source ↗
Figure 3
Figure 3. Figure 3: Rasterization Stages. Pixels colored by iteration counter. Stage 1: A single thread iterates over all sample positions inside the triangle’s bounding box. Stage 2: A warp (32 threads) iterates over all samples inside the bounding box. Stage 3: A workgroup (64 threads) iterates over all samples in a 64x64 tile to rasterize a portion of the triangle. As we primarily focus on massive models with hundreds of m… view at source ↗
Figure 4
Figure 4. Figure 4: Finding an appropriate mip map level by intersecting the triangle’s plane at the current and three adjacent pixels, extrapolat￾ing uv coordinates, and computing the extent in texture space. 4.5. Instancing Memory bandwidth is one of the major bottlenecks during rasteriza￾tion, particularly when rendering massive amounts of small triangles that require fewer compute resources. Instancing allows us to reduce… view at source ↗
Figure 5
Figure 5. Figure 5: Top: Colored by size (px) of the bounding box of a pixel’s triangle. Bottom: Pixels colored by the rasterization stage that cre￾ated the fragment. In Zorah, the vast majority of triangles are ras￾terized in stage 1 view at source ↗
read the original abstract

Previous work shows that small triangles can be rasterized efficiently with compute shaders. Building on this insight, we explore how far this can be pushed for massive triangle datasets without the need to construct acceleration structures in advance. Method: A 3-stage rasterization pipeline first rasterizes small triangles directly in stage 1, using atomicMin to store the closest fragments. Larger triangles are forwarded to stages 2 and 3. Results: CuRast can render models with hundreds of millions of triangles up to 2-5x (unique) or up to 12x (instanced) faster than Vulkan. Vulkan remains an order of magnitude faster for low-poly meshes. Limitations: We currently focus on dense, opaque meshes that you would typically obtain from photogrammetry/3D reconstruction. Blending/Transparency is not yet supported, and scenes with thousands of low-poly meshes are not implemented efficiently. Future Work: To make it suitable for games and a wider range of use cases, future work will need to (1) optimize handling of scenes with tens of thousands of nodes/meshes, (2) add support for hierarchical clustered LODs such as those produced by Meshoptimizer, (3) add support for transparency, likely in its own stage so as to keep opaque rasterization untouched and fast. Source Code: https://github.com/m-schuetz/CuRast

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents CuRast, a CUDA-based software rasterizer for massive triangle datasets (hundreds of millions to billions of triangles) that avoids pre-built acceleration structures. It uses a 3-stage pipeline: stage 1 directly rasterizes small triangles with atomicMin depth testing; larger triangles are forwarded to stages 2 and 3. Empirical results claim 2-5x speedups (unique meshes) or up to 12x (instanced) over Vulkan for dense models, while Vulkan is faster for low-poly scenes. The work targets dense opaque meshes from photogrammetry/3D reconstruction, with explicit limitations on transparency, blending, and scenes with thousands of low-poly meshes. Open source code is provided at https://github.com/m-schuetz/CuRast.

Significance. If the performance claims hold under rigorous verification, this could meaningfully advance real-time or near-real-time rendering of very large-scale 3D models in photogrammetry, cultural heritage, and scientific visualization, extending prior compute-shader rasterization techniques without relying on hardware rasterizers or BVH construction. The explicit scoping to dense opaque meshes and the release of reproducible source code are strengths that support direct validation of the reported speedups.

major comments (1)
  1. Abstract / Results: The performance claims (2-5x unique, up to 12x instanced speedups over Vulkan for hundreds of millions of triangles) are stated without accompanying details on benchmark hardware, exact scene parameters (triangle counts, density, instancing factors), measurement methodology, error bars, or data exclusion rules. This makes it difficult to assess the robustness and generalizability of the central empirical claims.
minor comments (2)
  1. Limitations paragraph: The phrasing 'dense, opaque meshes that you would typically obtain from photogrammetry/3D reconstruction' uses informal second-person language; rephrasing to 'meshes typically obtained from...' would improve academic tone.
  2. Future Work: The listed items (handling tens of thousands of nodes, hierarchical clustered LODs, transparency) are clearly scoped but could benefit from a brief note on how each would integrate with the existing 3-stage pipeline without degrading the opaque path.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation for minor revision. We address the single major comment point-by-point below.

read point-by-point responses
  1. Referee: Abstract / Results: The performance claims (2-5x unique, up to 12x instanced speedups over Vulkan for hundreds of millions of triangles) are stated without accompanying details on benchmark hardware, exact scene parameters (triangle counts, density, instancing factors), measurement methodology, error bars, or data exclusion rules. This makes it difficult to assess the robustness and generalizability of the central empirical claims.

    Authors: We agree that the abstract and results summary would benefit from additional experimental details to allow readers to better evaluate the claims. In the revised manuscript we will add a concise 'Experimental Setup' paragraph (and corresponding references from the abstract) that specifies: the exact GPU/CPU hardware and driver versions used for all timings; the precise triangle counts, mesh densities, and instancing factors for each reported scene; the timing methodology (CUDA events, warm-up iterations, number of measurement runs); and any data-exclusion criteria applied. Although our measurements exhibited low run-to-run variance, we will also report standard deviations for the key speedup figures. These additions will be kept brief so as not to lengthen the abstract unduly while still providing the requested context. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical implementation of a 3-stage CUDA software rasterization pipeline for dense opaque meshes, with performance results obtained via direct benchmarking against Vulkan on specific hardware. No mathematical derivations, fitted parameters, predictions, or uniqueness theorems appear in the argument; the method is presented as an engineering exploration building on prior compute-shader insights without reducing any claim to its own inputs by construction. Source code is provided for independent verification, and limitations are explicitly scoped, confirming the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Engineering implementation paper relying on standard GPU hardware features with no new mathematical axioms, free parameters, or invented entities.

axioms (1)
  • standard math GPU atomicMin operations correctly resolve per-pixel depth for fragments in compute shaders
    Core mechanism in stage 1 for storing closest fragments.

pith-pipeline@v0.9.0 · 5561 in / 1119 out tokens · 73672 ms · 2026-05-08T13:05:56.626115+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Geometrically Approximated Modeling for Emitter-Centric Ray-Triangle Filtering in Arbitrarily Dynamic LiDAR Simulation

    cs.GR 2026-05 unverdicted novelty 7.0

    GRCA uses emitter-centric geometric culling of rays per triangle to accelerate LiDAR simulation in arbitrarily dynamic scenes, reporting up to 14.55x speedup over Embree and 7.97x over OptiX.

Reference graph

Works this paper leans on

19 extracted references · 8 canonical work pages · cited by 1 Pith paper

  1. [1]

    Hardware implementation of micropolygon rasterization with motion and defocus blur

    [BFH10] BRUNHAVER, JOHNS, FATAHALIAN, KAYVON, and HANRA- HAN, PAT. “Hardware implementation of micropolygon rasterization with motion and defocus blur.”High Performance Graphics. 2010, 1–9

  2. [2]

    The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading

    [BH13] BURNS, CHRISTOPHERA. and HUNT, WARRENA. “The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading”.Journal of Computer Graphics Techniques (JCGT)2.2 (Aug. 2013), 55–69.ISSN: 2331-7418.URL: http://jcgt.org/published/0002/02/ 04/3. [BV26] BENE, ROBERTand VALASEK, GÁBOR.Helper-Lane Optimized Triangulation of Polygons. short paper. 2026

  3. [3]

    The University of Utah, 1974

    [Cat74] CATMULL, EDWINEARL.A subdivision algorithm for computer display of curved surfaces. The University of Utah, 1974

  4. [4]

    Robust fairing via conformal curvature flow

    [CPS13] CRANE, KEENAN, PINKALL, ULRICH, and SCHRÖDER, PETER. “Robust fairing via conformal curvature flow”.ACM Transactions on Graphics (TOG)32.4 (2013), 1–10

  5. [5]

    NVIDIA nvpro-samples

    [DMC*26] DABROVIC, MARKO, MEINL, FRANK, CRYTEK, et al.Sponza. NVIDIA nvpro-samples. Sponza has undergone several adjustments by different authors over the years. Originally created by Marko Dabrovic, then re-modelled by Frank Meinl at Crytek, Morgan McGuire, Hans- Kristian Arntzen and Ludicon. 2026.URL: https://github.com/ ludicon/sponza-gltf7,

  6. [6]

    Learning from failure: A Survey of Promising, Un- conventional and Mostly Abandoned Renderers for ‘Dreams PS4’, a Geo- metrically Dense, Painterly UGC Game

    [Eva15] EVANS, ALEX. “Learning from failure: A Survey of Promising, Un- conventional and Mostly Abandoned Renderers for ‘Dreams PS4’, a Geo- metrically Dense, Painterly UGC Game”.ACM SIGGRAPH 2015 Courses, Advances in Real-Time Rendering in Games. https://advances. realtimerendering . com / s2015 / AlexEvans _ SIGGRAPH - 2015-sml.pdf[Accessed 23-April-202...

  7. [7]

    Distributed by Open Heritage 3D

    [Gil20] GILDASSIDOBRE, NRHK.Komainu Kobe Ikuta-jinja. Distributed by Open Heritage 3D. 2020.DOI:10.26301/1wv3-97757,

  8. [8]

    A GPGPU-based Pipeline for Accelerated Rendering of Point Clouds

    [GKLR13] GÜNTHER, CHRISTIAN, KANZOK, THOMAS, LINSEN, LARS, and ROSENTHAL, PAUL. “A GPGPU-based Pipeline for Accelerated Rendering of Point Clouds”.J. WSCG21 (2013), 153–161

  9. [9]

    Fellner & S

    D. Fellner & S. Behnke / CuRast11 [Hab21] HABLE, JOHN.Visibility Buffer Rendering with Material Graphs. 2021.URL: http://filmicworlds.com/blog/visibility- buffer-rendering-with-material-graphs/4. [HFEM26] HAHLBOHM, FLORIAN, FRANKE, LINUS, EISEMANN, MAR- TIN, and MAGNOR, MARCUS.Faster-GS: Analyzing and Improving Gaussian Splatting Optimization

  10. [10]

    Faster-gs: Analyzing and improv- ing gaussian splatting optimization.arXiv preprint arXiv:2602.09999, 2026

    arXiv:2602.09999 [cs.CV]. URL:https://arxiv.org/abs/2602.099993. [ItF26] ICONEMand the FONDAZIONEMUSEICIVICI DIVENEZIA.Venice. 2026.URL:https://iconem.com/7,

  11. [11]

    https: //github.com/zeux/meshoptimizer

    [Kap26] KAPOULKINE, ARSENY.Meshoptimizer / gltfpack v1.1. https: //github.com/zeux/meshoptimizer. Apr. 2026 3, 7, 9,

  12. [12]

    Revisiting The Vertex Cache: Understanding and Op- timizing Vertex Processing on the modern GPU

    [Kar26] KARIS, BRIAN.Nanite + Reyes. 2026.URL: https : / / graphicrants.blogspot.com/2026/02/nanite- reyes. html10. [KKG*26] KUBISCH, CHRISTOPH, KNOWLES, PYARELAL, GAUTRON, PASCAL, et al.NVIDIA RTX Mega Geometry Now Available with New Vulkan Samples. 2026.URL: https://developer.nvidia.com/ blog / nvidia - rtx - mega - geometry - now - available - with-new...

  13. [13]

    FreePipe: a programmable parallel rendering architecture for efficient multi-fragment effects

    arXiv: 2510.08166 [cs.GR] .URL: https:// arxiv.org/abs/2510.081667. [LHLW10] LIU, FANG, HUANG, MENG-CHENG, LIU, XUE-HUI, and WU, EN-HUA. “FreePipe: a programmable parallel rendering architecture for efficient multi-fragment effects”.Proceedings of the 2010 ACM SIG- GRAPH Symposium on Interactive 3D Graphics and Games. I3D ’10. Washington, D.C.: Associatio...

  14. [14]

    High-performance soft- ware rasterization on GPUs

    [LK11] LAINE, SAMULIand KARRAS, TERO. “High-performance soft- ware rasterization on GPUs”.Proceedings of the ACM SIGGRAPH Sym- posium on High Performance Graphics. HPG ’11. Vancouver, British Columbia, Canada: Association for Computing Machinery, 2011, 79–88. ISBN: 9781450308960.DOI: 10.1145/2018323.2018337 .URL: https://doi.org/10.1145/2018323.20183372. ...

  15. [15]

    View-warped Multi-view Soft Shadows for Local Area Lights

    DOI:10.1109/38.2915282. [MWH18] MARRS, ADAM, WATSON, BENJAMIN, and HEALEY, CHRISTOPHER. “View-warped Multi-view Soft Shadows for Local Area Lights”.Journal of Computer Graphics Techniques (JCGT)7.3 (2018), 1– 28

  16. [16]

    NVIDIA RTX Advances with Neural Rendering and Digital Human Technologies at GDC 2025

    2025.URL: https://www. youtube.com/watch?v=udqApkIqZmQ6. [NVI25b] NVIDIA.Zorah. NVIDIA nvpro-samples. Export of NVIDIA RTX Kit - Zorah Sample as presented in "NVIDIA RTX Advances with Neural Rendering and Digital Human Technologies at GDC 2025"

  17. [17]

    A parallel algorithm for polygon rasterization

    [Pin88] PINEDA, JUAN. “A parallel algorithm for polygon rasterization”. Proceedings of the 15th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH ’88. New York, NY , USA: Associ- ation for Computing Machinery, 1988, 17–20.ISBN: 0897912756.DOI: 10 . 1145 / 54852 . 378457.URL: https : / / doi . org / 10 . 1145/54852.3784575. [PTSO1...

  18. [18]

    Deferred attribute interpolation for memory-efficient deferred shading

    [SD15] SCHIED, CHRISTOPHand DACHSBACHER, CARSTEN. “Deferred attribute interpolation for memory-efficient deferred shading”.Proceed- ings of the 7th Conference on High-Performance Graphics. 2015, 43– 49

  19. [19]

    Rendering Point Clouds with Compute Shaders and Vertex Order Optimization

    [SKW21] SCHÜTZ, MARKUS, KERBL, BERNHARD, and WIMMER, MICHAEL. “Rendering Point Clouds with Compute Shaders and Vertex Order Optimization”.Computer Graphics Forum40.4 (July 2021), 115– 126.ISSN: 1467-8659.DOI: 10.1111/cgf.14345 .URL: https: / / www . cg . tuwien . ac . at / research / publications / 2021/SCHUETZ-2021-PCC/3. [Web15] WEBER, THOMAS. “Micropol...