Recognition: unknown
CuRast: Cuda-Based Software Rasterization for Billions of Triangles
Pith reviewed 2026-05-08 13:05 UTC · model grok-4.3
The pith
A CUDA compute shader pipeline rasterizes dense meshes with hundreds of millions of triangles up to 12 times faster than Vulkan.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CuRast implements a 3-stage rasterization pipeline in CUDA that rasterizes small triangles directly in stage 1 with atomicMin to store the closest fragments and forwards larger triangles to stages 2 and 3. This allows rendering of models with hundreds of millions of triangles up to 2-5x faster than Vulkan for unique geometry and up to 12x for instanced geometry, without the need to construct acceleration structures beforehand.
What carries the argument
The 3-stage compute shader pipeline that classifies triangles by size and uses atomicMin depth writes for small triangles in the first stage.
If this is right
- Dense opaque meshes can be rendered substantially faster with this compute approach than with Vulkan.
- Instanced geometry receives the largest speedups.
- No acceleration structure construction step is required even at hundreds of millions of triangles.
- Scenes with thousands of low-poly meshes remain slower than Vulkan.
Where Pith is reading between the lines
- Porting the size-based staging logic to other compute APIs could extend the speedups to more hardware platforms.
- Adding a separate transparency stage would preserve the opaque path speed while widening applicability.
- Pairing the method with clustered LOD generation could push the practical limit toward billions of triangles.
Load-bearing premise
The input consists of dense, opaque meshes from photogrammetry or reconstruction without transparency, blending, or thousands of separate low-poly objects.
What would settle it
A side-by-side frame-time measurement on the same GPU for a 200-million-triangle photogrammetry model rendered once with CuRast and once with Vulkan.
Figures
read the original abstract
Previous work shows that small triangles can be rasterized efficiently with compute shaders. Building on this insight, we explore how far this can be pushed for massive triangle datasets without the need to construct acceleration structures in advance. Method: A 3-stage rasterization pipeline first rasterizes small triangles directly in stage 1, using atomicMin to store the closest fragments. Larger triangles are forwarded to stages 2 and 3. Results: CuRast can render models with hundreds of millions of triangles up to 2-5x (unique) or up to 12x (instanced) faster than Vulkan. Vulkan remains an order of magnitude faster for low-poly meshes. Limitations: We currently focus on dense, opaque meshes that you would typically obtain from photogrammetry/3D reconstruction. Blending/Transparency is not yet supported, and scenes with thousands of low-poly meshes are not implemented efficiently. Future Work: To make it suitable for games and a wider range of use cases, future work will need to (1) optimize handling of scenes with tens of thousands of nodes/meshes, (2) add support for hierarchical clustered LODs such as those produced by Meshoptimizer, (3) add support for transparency, likely in its own stage so as to keep opaque rasterization untouched and fast. Source Code: https://github.com/m-schuetz/CuRast
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents CuRast, a CUDA-based software rasterizer for massive triangle datasets (hundreds of millions to billions of triangles) that avoids pre-built acceleration structures. It uses a 3-stage pipeline: stage 1 directly rasterizes small triangles with atomicMin depth testing; larger triangles are forwarded to stages 2 and 3. Empirical results claim 2-5x speedups (unique meshes) or up to 12x (instanced) over Vulkan for dense models, while Vulkan is faster for low-poly scenes. The work targets dense opaque meshes from photogrammetry/3D reconstruction, with explicit limitations on transparency, blending, and scenes with thousands of low-poly meshes. Open source code is provided at https://github.com/m-schuetz/CuRast.
Significance. If the performance claims hold under rigorous verification, this could meaningfully advance real-time or near-real-time rendering of very large-scale 3D models in photogrammetry, cultural heritage, and scientific visualization, extending prior compute-shader rasterization techniques without relying on hardware rasterizers or BVH construction. The explicit scoping to dense opaque meshes and the release of reproducible source code are strengths that support direct validation of the reported speedups.
major comments (1)
- Abstract / Results: The performance claims (2-5x unique, up to 12x instanced speedups over Vulkan for hundreds of millions of triangles) are stated without accompanying details on benchmark hardware, exact scene parameters (triangle counts, density, instancing factors), measurement methodology, error bars, or data exclusion rules. This makes it difficult to assess the robustness and generalizability of the central empirical claims.
minor comments (2)
- Limitations paragraph: The phrasing 'dense, opaque meshes that you would typically obtain from photogrammetry/3D reconstruction' uses informal second-person language; rephrasing to 'meshes typically obtained from...' would improve academic tone.
- Future Work: The listed items (handling tens of thousands of nodes, hierarchical clustered LODs, transparency) are clearly scoped but could benefit from a brief note on how each would integrate with the existing 3-stage pipeline without degrading the opaque path.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and recommendation for minor revision. We address the single major comment point-by-point below.
read point-by-point responses
-
Referee: Abstract / Results: The performance claims (2-5x unique, up to 12x instanced speedups over Vulkan for hundreds of millions of triangles) are stated without accompanying details on benchmark hardware, exact scene parameters (triangle counts, density, instancing factors), measurement methodology, error bars, or data exclusion rules. This makes it difficult to assess the robustness and generalizability of the central empirical claims.
Authors: We agree that the abstract and results summary would benefit from additional experimental details to allow readers to better evaluate the claims. In the revised manuscript we will add a concise 'Experimental Setup' paragraph (and corresponding references from the abstract) that specifies: the exact GPU/CPU hardware and driver versions used for all timings; the precise triangle counts, mesh densities, and instancing factors for each reported scene; the timing methodology (CUDA events, warm-up iterations, number of measurement runs); and any data-exclusion criteria applied. Although our measurements exhibited low run-to-run variance, we will also report standard deviations for the key speedup figures. These additions will be kept brief so as not to lengthen the abstract unduly while still providing the requested context. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper describes an empirical implementation of a 3-stage CUDA software rasterization pipeline for dense opaque meshes, with performance results obtained via direct benchmarking against Vulkan on specific hardware. No mathematical derivations, fitted parameters, predictions, or uniqueness theorems appear in the argument; the method is presented as an engineering exploration building on prior compute-shader insights without reducing any claim to its own inputs by construction. Source code is provided for independent verification, and limitations are explicitly scoped, confirming the work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math GPU atomicMin operations correctly resolve per-pixel depth for fragments in compute shaders
Forward citations
Cited by 1 Pith paper
-
Geometrically Approximated Modeling for Emitter-Centric Ray-Triangle Filtering in Arbitrarily Dynamic LiDAR Simulation
GRCA uses emitter-centric geometric culling of rays per triangle to accelerate LiDAR simulation in arbitrarily dynamic scenes, reporting up to 14.55x speedup over Embree and 7.97x over OptiX.
Reference graph
Works this paper leans on
-
[1]
Hardware implementation of micropolygon rasterization with motion and defocus blur
[BFH10] BRUNHAVER, JOHNS, FATAHALIAN, KAYVON, and HANRA- HAN, PAT. “Hardware implementation of micropolygon rasterization with motion and defocus blur.”High Performance Graphics. 2010, 1–9
2010
-
[2]
The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading
[BH13] BURNS, CHRISTOPHERA. and HUNT, WARRENA. “The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading”.Journal of Computer Graphics Techniques (JCGT)2.2 (Aug. 2013), 55–69.ISSN: 2331-7418.URL: http://jcgt.org/published/0002/02/ 04/3. [BV26] BENE, ROBERTand VALASEK, GÁBOR.Helper-Lane Optimized Triangulation of Polygons. short paper. 2026
2013
-
[3]
The University of Utah, 1974
[Cat74] CATMULL, EDWINEARL.A subdivision algorithm for computer display of curved surfaces. The University of Utah, 1974
1974
-
[4]
Robust fairing via conformal curvature flow
[CPS13] CRANE, KEENAN, PINKALL, ULRICH, and SCHRÖDER, PETER. “Robust fairing via conformal curvature flow”.ACM Transactions on Graphics (TOG)32.4 (2013), 1–10
2013
-
[5]
NVIDIA nvpro-samples
[DMC*26] DABROVIC, MARKO, MEINL, FRANK, CRYTEK, et al.Sponza. NVIDIA nvpro-samples. Sponza has undergone several adjustments by different authors over the years. Originally created by Marko Dabrovic, then re-modelled by Frank Meinl at Crytek, Morgan McGuire, Hans- Kristian Arntzen and Ludicon. 2026.URL: https://github.com/ ludicon/sponza-gltf7,
2026
-
[6]
Learning from failure: A Survey of Promising, Un- conventional and Mostly Abandoned Renderers for ‘Dreams PS4’, a Geo- metrically Dense, Painterly UGC Game
[Eva15] EVANS, ALEX. “Learning from failure: A Survey of Promising, Un- conventional and Mostly Abandoned Renderers for ‘Dreams PS4’, a Geo- metrically Dense, Painterly UGC Game”.ACM SIGGRAPH 2015 Courses, Advances in Real-Time Rendering in Games. https://advances. realtimerendering . com / s2015 / AlexEvans _ SIGGRAPH - 2015-sml.pdf[Accessed 23-April-202...
2015
-
[7]
Distributed by Open Heritage 3D
[Gil20] GILDASSIDOBRE, NRHK.Komainu Kobe Ikuta-jinja. Distributed by Open Heritage 3D. 2020.DOI:10.26301/1wv3-97757,
-
[8]
A GPGPU-based Pipeline for Accelerated Rendering of Point Clouds
[GKLR13] GÜNTHER, CHRISTIAN, KANZOK, THOMAS, LINSEN, LARS, and ROSENTHAL, PAUL. “A GPGPU-based Pipeline for Accelerated Rendering of Point Clouds”.J. WSCG21 (2013), 153–161
2013
-
[9]
Fellner & S
D. Fellner & S. Behnke / CuRast11 [Hab21] HABLE, JOHN.Visibility Buffer Rendering with Material Graphs. 2021.URL: http://filmicworlds.com/blog/visibility- buffer-rendering-with-material-graphs/4. [HFEM26] HAHLBOHM, FLORIAN, FRANKE, LINUS, EISEMANN, MAR- TIN, and MAGNOR, MARCUS.Faster-GS: Analyzing and Improving Gaussian Splatting Optimization
2021
-
[10]
arXiv:2602.09999 [cs.CV]. URL:https://arxiv.org/abs/2602.099993. [ItF26] ICONEMand the FONDAZIONEMUSEICIVICI DIVENEZIA.Venice. 2026.URL:https://iconem.com/7,
-
[11]
https: //github.com/zeux/meshoptimizer
[Kap26] KAPOULKINE, ARSENY.Meshoptimizer / gltfpack v1.1. https: //github.com/zeux/meshoptimizer. Apr. 2026 3, 7, 9,
2026
-
[12]
Revisiting The Vertex Cache: Understanding and Op- timizing Vertex Processing on the modern GPU
[Kar26] KARIS, BRIAN.Nanite + Reyes. 2026.URL: https : / / graphicrants.blogspot.com/2026/02/nanite- reyes. html10. [KKG*26] KUBISCH, CHRISTOPH, KNOWLES, PYARELAL, GAUTRON, PASCAL, et al.NVIDIA RTX Mega Geometry Now Available with New Vulkan Samples. 2026.URL: https://developer.nvidia.com/ blog / nvidia - rtx - mega - geometry - now - available - with-new...
-
[13]
FreePipe: a programmable parallel rendering architecture for efficient multi-fragment effects
arXiv: 2510.08166 [cs.GR] .URL: https:// arxiv.org/abs/2510.081667. [LHLW10] LIU, FANG, HUANG, MENG-CHENG, LIU, XUE-HUI, and WU, EN-HUA. “FreePipe: a programmable parallel rendering architecture for efficient multi-fragment effects”.Proceedings of the 2010 ACM SIG- GRAPH Symposium on Interactive 3D Graphics and Games. I3D ’10. Washington, D.C.: Associatio...
-
[14]
High-performance soft- ware rasterization on GPUs
[LK11] LAINE, SAMULIand KARRAS, TERO. “High-performance soft- ware rasterization on GPUs”.Proceedings of the ACM SIGGRAPH Sym- posium on High Performance Graphics. HPG ’11. Vancouver, British Columbia, Canada: Association for Computing Machinery, 2011, 79–88. ISBN: 9781450308960.DOI: 10.1145/2018323.2018337 .URL: https://doi.org/10.1145/2018323.20183372. ...
-
[15]
View-warped Multi-view Soft Shadows for Local Area Lights
DOI:10.1109/38.2915282. [MWH18] MARRS, ADAM, WATSON, BENJAMIN, and HEALEY, CHRISTOPHER. “View-warped Multi-view Soft Shadows for Local Area Lights”.Journal of Computer Graphics Techniques (JCGT)7.3 (2018), 1– 28
-
[16]
NVIDIA RTX Advances with Neural Rendering and Digital Human Technologies at GDC 2025
2025.URL: https://www. youtube.com/watch?v=udqApkIqZmQ6. [NVI25b] NVIDIA.Zorah. NVIDIA nvpro-samples. Export of NVIDIA RTX Kit - Zorah Sample as presented in "NVIDIA RTX Advances with Neural Rendering and Digital Human Technologies at GDC 2025"
2025
-
[17]
A parallel algorithm for polygon rasterization
[Pin88] PINEDA, JUAN. “A parallel algorithm for polygon rasterization”. Proceedings of the 15th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH ’88. New York, NY , USA: Associ- ation for Computing Machinery, 1988, 17–20.ISBN: 0897912756.DOI: 10 . 1145 / 54852 . 378457.URL: https : / / doi . org / 10 . 1145/54852.3784575. [PTSO1...
-
[18]
Deferred attribute interpolation for memory-efficient deferred shading
[SD15] SCHIED, CHRISTOPHand DACHSBACHER, CARSTEN. “Deferred attribute interpolation for memory-efficient deferred shading”.Proceed- ings of the 7th Conference on High-Performance Graphics. 2015, 43– 49
2015
-
[19]
Rendering Point Clouds with Compute Shaders and Vertex Order Optimization
[SKW21] SCHÜTZ, MARKUS, KERBL, BERNHARD, and WIMMER, MICHAEL. “Rendering Point Clouds with Compute Shaders and Vertex Order Optimization”.Computer Graphics Forum40.4 (July 2021), 115– 126.ISSN: 1467-8659.DOI: 10.1111/cgf.14345 .URL: https: / / www . cg . tuwien . ac . at / research / publications / 2021/SCHUETZ-2021-PCC/3. [Web15] WEBER, THOMAS. “Micropol...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.