arxiv: 2604.21749 · v2 · submitted 2026-04-23 · 💻 cs.GR

Recognition: unknown

CuRast: Cuda-Based Software Rasterization for Billions of Triangles

Markus Sch\"utz , Lukas Lipp , Elias Kristmann , Michael Wimmer

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:05 UTC · model grok-4.3

classification 💻 cs.GR

keywords software rasterizationCUDAcompute shaderslarge triangle meshesphotogrammetryVulkanatomic operationsdepth buffering

0 comments

The pith

A CUDA compute shader pipeline rasterizes dense meshes with hundreds of millions of triangles up to 12 times faster than Vulkan.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper is trying to establish that software rasterization in CUDA can outperform Vulkan hardware rasterization for very large, dense triangle models. It does so with a three-stage pipeline that rasterizes small triangles directly using atomicMin for depth and forwards larger triangles to later stages, all without building acceleration structures in advance. A sympathetic reader would care because photogrammetry and 3D reconstruction frequently produce exactly these dense, opaque meshes where traditional pipelines incur overhead from data structures.

Core claim

CuRast implements a 3-stage rasterization pipeline in CUDA that rasterizes small triangles directly in stage 1 with atomicMin to store the closest fragments and forwards larger triangles to stages 2 and 3. This allows rendering of models with hundreds of millions of triangles up to 2-5x faster than Vulkan for unique geometry and up to 12x for instanced geometry, without the need to construct acceleration structures beforehand.

What carries the argument

The 3-stage compute shader pipeline that classifies triangles by size and uses atomicMin depth writes for small triangles in the first stage.

If this is right

Dense opaque meshes can be rendered substantially faster with this compute approach than with Vulkan.
Instanced geometry receives the largest speedups.
No acceleration structure construction step is required even at hundreds of millions of triangles.
Scenes with thousands of low-poly meshes remain slower than Vulkan.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Porting the size-based staging logic to other compute APIs could extend the speedups to more hardware platforms.
Adding a separate transparency stage would preserve the opaque path speed while widening applicability.
Pairing the method with clustered LOD generation could push the practical limit toward billions of triangles.

Load-bearing premise

The input consists of dense, opaque meshes from photogrammetry or reconstruction without transparency, blending, or thousands of separate low-poly objects.

What would settle it

A side-by-side frame-time measurement on the same GPU for a 200-million-triangle photogrammetry model rendered once with CuRast and once with Vulkan.

Figures

Figures reproduced from arXiv: 2604.21749 by Elias Kristmann, Lukas Lipp, Markus Sch\"utz, Michael Wimmer.

**Figure 1.** Figure 1: Brute-force rendering the Zorah data set on an RTX 5090. 38.8GB of geometry loaded from an SSD, compressed to 21.7 GB, transferred to GPU and ready to render in 6.6 seconds. 18.9 billion triangles total; 13.5 billion triangles visible after frustum culling. Rendered in 67.3 milliseconds into a 3840×2160 framebuffer with screen space ambient occlusion and eye-dome lighting enabled. Abstract Previous work sh… view at source ↗

**Figure 3.** Figure 3: Rasterization Stages. Pixels colored by iteration counter. Stage 1: A single thread iterates over all sample positions inside the triangle’s bounding box. Stage 2: A warp (32 threads) iterates over all samples inside the bounding box. Stage 3: A workgroup (64 threads) iterates over all samples in a 64x64 tile to rasterize a portion of the triangle. As we primarily focus on massive models with hundreds of m… view at source ↗

**Figure 4.** Figure 4: Finding an appropriate mip map level by intersecting the triangle’s plane at the current and three adjacent pixels, extrapolating uv coordinates, and computing the extent in texture space. 4.5. Instancing Memory bandwidth is one of the major bottlenecks during rasterization, particularly when rendering massive amounts of small triangles that require fewer compute resources. Instancing allows us to reduce… view at source ↗

**Figure 5.** Figure 5: Top: Colored by size (px) of the bounding box of a pixel’s triangle. Bottom: Pixels colored by the rasterization stage that created the fragment. In Zorah, the vast majority of triangles are rasterized in stage 1 view at source ↗

read the original abstract

Previous work shows that small triangles can be rasterized efficiently with compute shaders. Building on this insight, we explore how far this can be pushed for massive triangle datasets without the need to construct acceleration structures in advance. Method: A 3-stage rasterization pipeline first rasterizes small triangles directly in stage 1, using atomicMin to store the closest fragments. Larger triangles are forwarded to stages 2 and 3. Results: CuRast can render models with hundreds of millions of triangles up to 2-5x (unique) or up to 12x (instanced) faster than Vulkan. Vulkan remains an order of magnitude faster for low-poly meshes. Limitations: We currently focus on dense, opaque meshes that you would typically obtain from photogrammetry/3D reconstruction. Blending/Transparency is not yet supported, and scenes with thousands of low-poly meshes are not implemented efficiently. Future Work: To make it suitable for games and a wider range of use cases, future work will need to (1) optimize handling of scenes with tens of thousands of nodes/meshes, (2) add support for hierarchical clustered LODs such as those produced by Meshoptimizer, (3) add support for transparency, likely in its own stage so as to keep opaque rasterization untouched and fast. Source Code: https://github.com/m-schuetz/CuRast

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CuRast gives a practical three-stage CUDA pipeline that rasterizes hundreds of millions of dense opaque triangles faster than Vulkan without building acceleration structures, with the code released for direct checks.

read the letter

The paper's core contribution is a complete CUDA software rasterizer built around splitting triangles by size: small ones go straight to atomicMin depth testing in stage 1, while larger ones get forwarded to stages 2 and 3. This avoids the overhead of general acceleration structures for the photogrammetry-style meshes the authors target, and they report 2-5x gains on unique data and up to 12x on instanced data compared with Vulkan. The work extends earlier compute-shader rasterization results into a usable end-to-end system for billions of triangles, which is the main new piece here. They also ship the full source on GitHub, which lets anyone reproduce the numbers and see exactly how the stages interact. That is real value for anyone who needs fast software rendering on CUDA hardware for reconstruction pipelines. The claims stay scoped to dense, opaque meshes, and the limitations section is explicit about the missing pieces like transparency and support for thousands of low-poly nodes. Benchmarks are presented at a high level in the abstract, but the code availability removes most of the usual verification burden. No load-bearing math or derivations appear, just straightforward compute-shader engineering and empirical timing. The citation pattern is standard and points back to the relevant prior rasterization papers without overclaiming. This is useful for people who already work with massive reconstructed models and want a CUDA-specific path that beats the hardware rasterizer in that narrow regime. It is not aimed at game engines or general scenes. I would send it to peer review because the implementation is concrete, the scope is honest, and the code makes the performance numbers falsifiable.

Referee Report

1 major / 2 minor

Summary. The paper presents CuRast, a CUDA-based software rasterizer for massive triangle datasets (hundreds of millions to billions of triangles) that avoids pre-built acceleration structures. It uses a 3-stage pipeline: stage 1 directly rasterizes small triangles with atomicMin depth testing; larger triangles are forwarded to stages 2 and 3. Empirical results claim 2-5x speedups (unique meshes) or up to 12x (instanced) over Vulkan for dense models, while Vulkan is faster for low-poly scenes. The work targets dense opaque meshes from photogrammetry/3D reconstruction, with explicit limitations on transparency, blending, and scenes with thousands of low-poly meshes. Open source code is provided at https://github.com/m-schuetz/CuRast.

Significance. If the performance claims hold under rigorous verification, this could meaningfully advance real-time or near-real-time rendering of very large-scale 3D models in photogrammetry, cultural heritage, and scientific visualization, extending prior compute-shader rasterization techniques without relying on hardware rasterizers or BVH construction. The explicit scoping to dense opaque meshes and the release of reproducible source code are strengths that support direct validation of the reported speedups.

major comments (1)

Abstract / Results: The performance claims (2-5x unique, up to 12x instanced speedups over Vulkan for hundreds of millions of triangles) are stated without accompanying details on benchmark hardware, exact scene parameters (triangle counts, density, instancing factors), measurement methodology, error bars, or data exclusion rules. This makes it difficult to assess the robustness and generalizability of the central empirical claims.

minor comments (2)

Limitations paragraph: The phrasing 'dense, opaque meshes that you would typically obtain from photogrammetry/3D reconstruction' uses informal second-person language; rephrasing to 'meshes typically obtained from...' would improve academic tone.
Future Work: The listed items (handling tens of thousands of nodes, hierarchical clustered LODs, transparency) are clearly scoped but could benefit from a brief note on how each would integrate with the existing 3-stage pipeline without degrading the opaque path.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation for minor revision. We address the single major comment point-by-point below.

read point-by-point responses

Referee: Abstract / Results: The performance claims (2-5x unique, up to 12x instanced speedups over Vulkan for hundreds of millions of triangles) are stated without accompanying details on benchmark hardware, exact scene parameters (triangle counts, density, instancing factors), measurement methodology, error bars, or data exclusion rules. This makes it difficult to assess the robustness and generalizability of the central empirical claims.

Authors: We agree that the abstract and results summary would benefit from additional experimental details to allow readers to better evaluate the claims. In the revised manuscript we will add a concise 'Experimental Setup' paragraph (and corresponding references from the abstract) that specifies: the exact GPU/CPU hardware and driver versions used for all timings; the precise triangle counts, mesh densities, and instancing factors for each reported scene; the timing methodology (CUDA events, warm-up iterations, number of measurement runs); and any data-exclusion criteria applied. Although our measurements exhibited low run-to-run variance, we will also report standard deviations for the key speedup figures. These additions will be kept brief so as not to lengthen the abstract unduly while still providing the requested context. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical implementation of a 3-stage CUDA software rasterization pipeline for dense opaque meshes, with performance results obtained via direct benchmarking against Vulkan on specific hardware. No mathematical derivations, fitted parameters, predictions, or uniqueness theorems appear in the argument; the method is presented as an engineering exploration building on prior compute-shader insights without reducing any claim to its own inputs by construction. Source code is provided for independent verification, and limitations are explicitly scoped, confirming the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Engineering implementation paper relying on standard GPU hardware features with no new mathematical axioms, free parameters, or invented entities.

axioms (1)

standard math GPU atomicMin operations correctly resolve per-pixel depth for fragments in compute shaders
Core mechanism in stage 1 for storing closest fragments.

pith-pipeline@v0.9.0 · 5561 in / 1119 out tokens · 73672 ms · 2026-05-08T13:05:56.626115+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Geometrically Approximated Modeling for Emitter-Centric Ray-Triangle Filtering in Arbitrarily Dynamic LiDAR Simulation
cs.GR 2026-05 unverdicted novelty 7.0

GRCA uses emitter-centric geometric culling of rays per triangle to accelerate LiDAR simulation in arbitrarily dynamic scenes, reporting up to 14.55x speedup over Embree and 7.97x over OptiX.

Reference graph

Works this paper leans on

19 extracted references · 8 canonical work pages · cited by 1 Pith paper

[1]

Hardware implementation of micropolygon rasterization with motion and defocus blur

[BFH10] BRUNHAVER, JOHNS, FATAHALIAN, KAYVON, and HANRA- HAN, PAT. “Hardware implementation of micropolygon rasterization with motion and defocus blur.”High Performance Graphics. 2010, 1–9

2010
[2]

The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading

[BH13] BURNS, CHRISTOPHERA. and HUNT, WARRENA. “The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading”.Journal of Computer Graphics Techniques (JCGT)2.2 (Aug. 2013), 55–69.ISSN: 2331-7418.URL: http://jcgt.org/published/0002/02/ 04/3. [BV26] BENE, ROBERTand VALASEK, GÁBOR.Helper-Lane Optimized Triangulation of Polygons. short paper. 2026

2013
[3]

The University of Utah, 1974

[Cat74] CATMULL, EDWINEARL.A subdivision algorithm for computer display of curved surfaces. The University of Utah, 1974

1974
[4]

Robust fairing via conformal curvature flow

[CPS13] CRANE, KEENAN, PINKALL, ULRICH, and SCHRÖDER, PETER. “Robust fairing via conformal curvature flow”.ACM Transactions on Graphics (TOG)32.4 (2013), 1–10

2013
[5]

NVIDIA nvpro-samples

[DMC*26] DABROVIC, MARKO, MEINL, FRANK, CRYTEK, et al.Sponza. NVIDIA nvpro-samples. Sponza has undergone several adjustments by different authors over the years. Originally created by Marko Dabrovic, then re-modelled by Frank Meinl at Crytek, Morgan McGuire, Hans- Kristian Arntzen and Ludicon. 2026.URL: https://github.com/ ludicon/sponza-gltf7,

2026
[6]

Learning from failure: A Survey of Promising, Un- conventional and Mostly Abandoned Renderers for ‘Dreams PS4’, a Geo- metrically Dense, Painterly UGC Game

[Eva15] EVANS, ALEX. “Learning from failure: A Survey of Promising, Un- conventional and Mostly Abandoned Renderers for ‘Dreams PS4’, a Geo- metrically Dense, Painterly UGC Game”.ACM SIGGRAPH 2015 Courses, Advances in Real-Time Rendering in Games. https://advances. realtimerendering . com / s2015 / AlexEvans _ SIGGRAPH - 2015-sml.pdf[Accessed 23-April-202...

2015
[7]

Distributed by Open Heritage 3D

[Gil20] GILDASSIDOBRE, NRHK.Komainu Kobe Ikuta-jinja. Distributed by Open Heritage 3D. 2020.DOI:10.26301/1wv3-97757,

work page doi:10.26301/1wv3-97757 2020
[8]

A GPGPU-based Pipeline for Accelerated Rendering of Point Clouds

[GKLR13] GÜNTHER, CHRISTIAN, KANZOK, THOMAS, LINSEN, LARS, and ROSENTHAL, PAUL. “A GPGPU-based Pipeline for Accelerated Rendering of Point Clouds”.J. WSCG21 (2013), 153–161

2013
[9]

Fellner & S

D. Fellner & S. Behnke / CuRast11 [Hab21] HABLE, JOHN.Visibility Buffer Rendering with Material Graphs. 2021.URL: http://filmicworlds.com/blog/visibility- buffer-rendering-with-material-graphs/4. [HFEM26] HAHLBOHM, FLORIAN, FRANKE, LINUS, EISEMANN, MAR- TIN, and MAGNOR, MARCUS.Faster-GS: Analyzing and Improving Gaussian Splatting Optimization

2021
[10]

Faster-gs: Analyzing and improv- ing gaussian splatting optimization.arXiv preprint arXiv:2602.09999, 2026

arXiv:2602.09999 [cs.CV]. URL:https://arxiv.org/abs/2602.099993. [ItF26] ICONEMand the FONDAZIONEMUSEICIVICI DIVENEZIA.Venice. 2026.URL:https://iconem.com/7,

work page arXiv 2026
[11]

https: //github.com/zeux/meshoptimizer

[Kap26] KAPOULKINE, ARSENY.Meshoptimizer / gltfpack v1.1. https: //github.com/zeux/meshoptimizer. Apr. 2026 3, 7, 9,

2026
[12]

Revisiting The Vertex Cache: Understanding and Op- timizing Vertex Processing on the modern GPU

[Kar26] KARIS, BRIAN.Nanite + Reyes. 2026.URL: https : / / graphicrants.blogspot.com/2026/02/nanite- reyes. html10. [KKG*26] KUBISCH, CHRISTOPH, KNOWLES, PYARELAL, GAUTRON, PASCAL, et al.NVIDIA RTX Mega Geometry Now Available with New Vulkan Samples. 2026.URL: https://developer.nvidia.com/ blog / nvidia - rtx - mega - geometry - now - available - with-new...

work page doi:10.1145/3233302.url: 2026
[13]

FreePipe: a programmable parallel rendering architecture for efficient multi-fragment effects

arXiv: 2510.08166 [cs.GR] .URL: https:// arxiv.org/abs/2510.081667. [LHLW10] LIU, FANG, HUANG, MENG-CHENG, LIU, XUE-HUI, and WU, EN-HUA. “FreePipe: a programmable parallel rendering architecture for efficient multi-fragment effects”.Proceedings of the 2010 ACM SIG- GRAPH Symposium on Interactive 3D Graphics and Games. I3D ’10. Washington, D.C.: Associatio...

work page doi:10.1145/1730804.1730817 2010
[14]

High-performance soft- ware rasterization on GPUs

[LK11] LAINE, SAMULIand KARRAS, TERO. “High-performance soft- ware rasterization on GPUs”.Proceedings of the ACM SIGGRAPH Sym- posium on High Performance Graphics. HPG ’11. Vancouver, British Columbia, Canada: Association for Computing Machinery, 2011, 79–88. ISBN: 9781450308960.DOI: 10.1145/2018323.2018337 .URL: https://doi.org/10.1145/2018323.20183372. ...

work page doi:10.1145/2018323.2018337 2011
[15]

View-warped Multi-view Soft Shadows for Local Area Lights

DOI:10.1109/38.2915282. [MWH18] MARRS, ADAM, WATSON, BENJAMIN, and HEALEY, CHRISTOPHER. “View-warped Multi-view Soft Shadows for Local Area Lights”.Journal of Computer Graphics Techniques (JCGT)7.3 (2018), 1– 28

work page doi:10.1109/38.2915282 2018
[16]

NVIDIA RTX Advances with Neural Rendering and Digital Human Technologies at GDC 2025

2025.URL: https://www. youtube.com/watch?v=udqApkIqZmQ6. [NVI25b] NVIDIA.Zorah. NVIDIA nvpro-samples. Export of NVIDIA RTX Kit - Zorah Sample as presented in "NVIDIA RTX Advances with Neural Rendering and Digital Human Technologies at GDC 2025"

2025
[17]

A parallel algorithm for polygon rasterization

[Pin88] PINEDA, JUAN. “A parallel algorithm for polygon rasterization”. Proceedings of the 15th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH ’88. New York, NY , USA: Associ- ation for Computing Machinery, 1988, 17–20.ISBN: 0897912756.DOI: 10 . 1145 / 54852 . 378457.URL: https : / / doi . org / 10 . 1145/54852.3784575. [PTSO1...

work page doi:10.1145/2766973 1988
[18]

Deferred attribute interpolation for memory-efficient deferred shading

[SD15] SCHIED, CHRISTOPHand DACHSBACHER, CARSTEN. “Deferred attribute interpolation for memory-efficient deferred shading”.Proceed- ings of the 7th Conference on High-Performance Graphics. 2015, 43– 49

2015
[19]

Rendering Point Clouds with Compute Shaders and Vertex Order Optimization

[SKW21] SCHÜTZ, MARKUS, KERBL, BERNHARD, and WIMMER, MICHAEL. “Rendering Point Clouds with Compute Shaders and Vertex Order Optimization”.Computer Graphics Forum40.4 (July 2021), 115– 126.ISSN: 1467-8659.DOI: 10.1111/cgf.14345 .URL: https: / / www . cg . tuwien . ac . at / research / publications / 2021/SCHUETZ-2021-PCC/3. [Web15] WEBER, THOMAS. “Micropol...

work page doi:10.1111/cgf.14345 2021