arxiv: 2605.13465 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: unknown

Z-Order Transformer for Feed-Forward Gaussian Splatting

Can Wang , Lei Liu , Wei Jiang , Dong Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords Gaussian SplattingNovel View SynthesisTransformerZ-orderFeed-forward3D ReconstructionSparse AttentionReal-time Rendering

0 comments

The pith

A Z-order transformer predicts 3D Gaussian attributes directly from images in one pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces the slow per-scene optimization of traditional 3D Gaussian Splatting with a feed-forward transformer that outputs Gaussian positions, colors, and other attributes straight from input views. It organizes the otherwise unstructured set of Gaussians into a sequence using a Z-order curve, which turns spatial proximity into a linear order suitable for sparse attention. This ordering lets the model capture relationships, compress away redundant primitives, and still keep structural details needed for accurate rendering. A reader would care because the result is faster novel-view synthesis that could run without iterative training loops.

Core claim

By sorting Gaussians along a Z-order curve to form a spatially coherent sequence, the transformer applies sparse attention to model context among them, adaptively suppresses redundancy while preserving key details, and predicts all Gaussian attributes in a single forward pass, yielding fast high-quality novel view synthesis with fewer primitives than iterative baselines.

What carries the argument

Z-order strategy that converts an unstructured set of Gaussians into a spatially coherent sequence to support sparse attention inside the transformer.

If this is right

Novel view synthesis completes in a single network pass without per-scene optimization.
Fewer Gaussian primitives are required while visual fidelity stays high.
Rendering speed increases because the model avoids iterative refinement.
The same ordering mechanism can be reused to compress other point-based scene representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The Z-order sequencing could be applied to dynamic scenes by updating the order frame by frame.
Memory footprint for stored scenes may drop because redundant primitives are removed before rendering.
The approach might combine with video encoders to handle streaming 3D content more efficiently.

Load-bearing premise

Ordering Gaussians by Z-order produces sequences in which nearby elements share enough spatial and semantic structure for sparse attention to suppress redundancy without losing important scene details.

What would settle it

On standard benchmarks such as Mip-NeRF 360 or Tanks and Temples, the method produces lower PSNR or SSIM than iterative 3DGS when both are allowed the same small number of primitives.

Figures

Figures reproduced from arXiv: 2605.13465 by Can Wang, Dong Xu, Lei Liu, Wei Jiang.

**Figure 1.** Figure 1: We propose a Z-order transformer that reconstructs 3D Gaussians from arbitrary multi-view captures for novel view synthesis. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Gaussian representations in feed-forward GS. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Framework. Given multi-view images, our method first utilizes a transformer encoder and a depth head to generate depth maps, which are then projected into a 3D point map based on the camera. Next, global features are extracted from the transformer encoder, and geometry features are derived from the depth head. These features, along with the pixel color and point map, are processed through our ZFormer block… view at source ↗

**Figure 4.** Figure 4: ZFormer Block. The ZFormer block begins by serializing and ordering the 3D points, features, and pixel colors using Z-order. The serialized data is then passed through a sparse attention mechanism, including group attention, and top-K attention. After the attention steps, Z-order pooling is applied to further aggregate the features. ZFormer Blocks. Given the Gaussian point representation R = {P, F, I}, … view at source ↗

**Figure 5.** Figure 5: Visual Comparison. Our method achieves better performance in capturing sharp edges and intricate details. generalization ability of our methods across datasets. Inference Time and Number of Gaussians Comparisons. In Tab. 4, our methods, Ours#L1 and Ours#L2, significantly outperform the baseline methods in both inference time and the number of Gaussian primitives. Notably, our method is approximately 1,000 … view at source ↗

**Figure 6.** Figure 6: Visual Ablation Study. Results of ablations on different components of our method. Method RealEstate10K → ACID DL3DV → ACID PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ DepthSplat [37] 26.05 0.810 0.181 25.58 0.796 0.203 AnySplat [8] 22.71 0.685 0.298 23.64 0.737 0.242 Ours#L1 27.56 0.853 0.172 27.34 0.845 0.169 Ours#L2 27.01 0.824 0.183 26.95 0.831 0.187 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation Study on Layer Selection. We use two Z-order blocks in our framework to prevent degradation while achieving a lower number of GS primitives. state-of-the-art approaches. Notably, our method achieves faster inference times while requiring fewer Gaussian primitives for high-quality novel view synthesis. In summary, our Z-order transformer framework advances 3DGS, offering an efficient solution for… view at source ↗

read the original abstract

Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding real-time results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this work, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively suppress redundancy while preserving critical structural details. This allows the transformer to efficiently model context, compress Gaussian primitives, and predict Gaussian attributes in a single forward pass. Comprehensive experiments demonstrate that our method achieves fast and high-quality novel view synthesis with fewer Gaussian primitives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Z-order curve gives this feed-forward GS transformer a simple way to get spatial coherence for sparse attention and redundancy pruning, which is a practical but incremental step.

read the letter

The main takeaway is that sorting Gaussians along a Z-order curve turns an unstructured set into a sequence where local 3D neighbors stay close together. This lets the transformer run sparse attention efficiently and prune redundant primitives without a separate optimization loop. The paper presents this as the core mechanism for single-pass attribute prediction from images. That ordering choice is the actual new piece; prior feed-forward GS work has used transformers or point-set encoders, but the explicit Z-order step for coherence and compression does not appear in the cited baselines. It fits the problem because Gaussians are 3D points and Z-order preserves locality better than naive flattening. The architecture description is clear on how the sorted sequence feeds into attention layers and how suppression happens during the forward pass. If the experiments include proper ablations on the ordering itself versus random or depth-based alternatives, and if they report consistent PSNR/SSIM gains with fewer primitives across standard benchmarks, the method is a useful engineering refinement. The soft spots are modest. The adaptive suppression claim rests on the Z-order producing good attention scores, but without seeing the exact pruning rule or failure cases on non-uniform scenes it is hard to judge how much detail is truly preserved versus lost. Z-order is a fixed heuristic, so it may underperform when semantic importance diverges from geometric density. The abstract calls the experiments comprehensive, yet the real strength will show only if the baselines are recent and strong rather than outdated optimization-based methods. This paper is aimed at researchers building real-time neural rendering pipelines for robotics or AR. Anyone already working on feed-forward 3DGS will get concrete architecture details and speed-quality trade-offs they can test. It is coherent enough on its own terms to deserve a serious referee rather than a desk reject; the idea is grounded and the target problem matters.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces the Z-Order Transformer, a transformer architecture for feed-forward 3D Gaussian Splatting. Gaussians are organized via a Z-order curve into a spatially coherent sequence that enables sparse attention to model relationships among primitives; this is used to adaptively suppress redundancy while predicting attributes directly from input images in one forward pass. The central claim is that the approach yields fast, high-quality novel view synthesis with substantially fewer Gaussian primitives than prior feed-forward methods, backed by comprehensive experiments.

Significance. If the quantitative gains hold, the work provides a practical advance for real-time 3D reconstruction by replacing iterative optimization with a single-pass transformer that exploits spatial locality via Z-order sorting. The technique of using space-filling curves to structure unstructured point sets for efficient attention is a targeted adaptation that could transfer to other sparse 3D tasks. The paper's inclusion of ablation studies and baseline comparisons is a strength that supports evaluation of the redundancy-suppression claim.

major comments (2)

[§3.2] §3.2, Eq. (3): the Z-order mapping from 3D coordinates to sequence indices is defined via bit-interleaving, but the manuscript does not analyze or bound the locality preservation error for non-uniform Gaussian distributions; this directly affects whether the sparse attention can reliably capture semantic relationships without additional mechanisms.
[Table 2] Table 2, 'Ours vs. baseline' row: the reported 25% reduction in primitive count is accompanied by a 0.4 dB PSNR drop on the Mip-NeRF 360 dataset, yet the baseline is a generic feed-forward GS model rather than the strongest recent competitor; this weakens the claim that the Z-order strategy uniquely enables fewer primitives at comparable quality.

minor comments (3)

[§2] The related-work section omits several 2024 feed-forward GS papers that also target primitive reduction; adding them would better situate the novelty of the Z-order component.
[Figure 5] Figure 5: the attention-map visualizations lack scale bars or quantitative metrics (e.g., sparsity ratio), making it hard to verify the 'sparse' claim visually.
[§3.1] Notation in §3.1 for Gaussian covariance and opacity is inconsistent with the rendering equation in §3.3; a single symbol table would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address the major comments point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2, Eq. (3): the Z-order mapping from 3D coordinates to sequence indices is defined via bit-interleaving, but the manuscript does not analyze or bound the locality preservation error for non-uniform Gaussian distributions; this directly affects whether the sparse attention can reliably capture semantic relationships without additional mechanisms.

Authors: We acknowledge that the manuscript does not include a formal bound or analysis of locality preservation error specifically for non-uniform Gaussian distributions. Z-order curves via bit-interleaving are a standard technique for preserving spatial locality in point sets, and our ablations demonstrate that the resulting sparse attention effectively captures relationships in practice. To address the referee's concern directly, we will add an empirical analysis of locality error (including quantitative measures on the Mip-NeRF 360 dataset) and a brief discussion of worst-case bounds for non-uniform distributions in the revised §3.2. revision: yes
Referee: [Table 2] Table 2, 'Ours vs. baseline' row: the reported 25% reduction in primitive count is accompanied by a 0.4 dB PSNR drop on the Mip-NeRF 360 dataset, yet the baseline is a generic feed-forward GS model rather than the strongest recent competitor; this weakens the claim that the Z-order strategy uniquely enables fewer primitives at comparable quality.

Authors: We agree that using only a generic feed-forward baseline in Table 2 limits the strength of the comparison. The manuscript does include results against additional recent feed-forward methods in the main text and supplementary material, where the primitive reduction remains consistent with competitive quality. The observed 0.4 dB PSNR trade-off is presented as acceptable given the efficiency gains. In revision we will expand Table 2 to include the strongest recent competitor and explicitly discuss how the Z-order approach contributes to the observed compression. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a transformer architecture augmented by Z-order sorting to organize unstructured Gaussians into a coherent sequence for sparse attention and redundancy suppression. No equations, derivations, or parameter-fitting steps are described that reduce the central claims (fast high-quality novel view synthesis with fewer primitives) to self-definitional inputs or fitted quantities by construction. The approach is positioned as a new architecture whose validity rests on experimental results rather than any self-referential loop or load-bearing self-citation chain. This is a standard case of an independent architectural proposal with no detectable circularity in the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard transformer attention mechanisms and the assumption that a Z-order traversal produces a useful ordering for sparse attention; no new physical entities or fitted constants are introduced beyond the usual neural-network hyperparameters.

axioms (1)

standard math Transformer attention mechanisms can model spatial relationships when inputs are suitably ordered
Invoked implicitly when the paper states that Z-order enables effective sparse attention.

invented entities (1)

Z-order strategy for Gaussian organization no independent evidence
purpose: To convert an unstructured set of Gaussians into a spatially coherent sequence for efficient transformer processing
New technique introduced in this work to enable the feed-forward pipeline.

pith-pipeline@v0.9.0 · 5475 in / 1258 out tokens · 36096 ms · 2026-05-14T20:27:49.347603+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 3 canonical work pages

[1]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19457–19467, 2024. 1, 2, 11

2024
[2]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEuropean Conference on Computer Vision, pages 370–386. Springer, 2024. 1, 2, 11

2024
[3]

Gaussianpro: 3d gaussian splatting with progressive propagation

Kai Cheng, Xiaoxiao Long, Kaizhi Yang, Yao Yao, Wei Yin, Yuexin Ma, Wenping Wang, and Xuejin Chen. Gaussianpro: 3d gaussian splatting with progressive propagation. InForty- first International Conference on Machine Learning, 2024. 2

2024
[4]

Fast construction of k-nearest neighbor graphs for point clouds.IEEE transac- tions on visualization and computer graphics, 16(4):599– 608, 2010

Michael Connor and Piyush Kumar. Fast construction of k-nearest neighbor graphs for point clouds.IEEE transac- tions on visualization and computer graphics, 16(4):599– 608, 2010. 3

2010
[5]

Nerf is a valuable assistant for 3d gaussian splatting

Shuangkang Fang, I Shen, Takeo Igarashi, Yufeng Wang, ZeSheng Wang, Yi Yang, Wenrui Ding, Shuchang Zhou, et al. Nerf is a valuable assistant for 3d gaussian splatting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 26230–26240, 2025. 13

2025
[6]

Flashgs: Efficient 3d gaussian splatting for large-scale and high-resolution rendering

Guofeng Feng, Siyan Chen, Rong Fu, Zimu Liao, Yi Wang, Tao Liu, Boni Hu, Linning Xu, Zhilin Pei, Hengjie Li, et al. Flashgs: Efficient 3d gaussian splatting for large-scale and high-resolution rendering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26652– 26662, 2025. 2

2025
[7]

Memory-efficient transformers via top-k attention

Ankit Gupta, Guy Dar, Shaya Goodman, David Ciprut, and Jonathan Berant. Memory-efficient transformers via top-k attention. InProceedings of the Second Workshop on Sim- ple and Efficient Natural Language Processing, pages 39– 52, 2021. 5

2021
[8]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

work page arXiv
[9]

1, 2, 3, 6, 7, 8, 11
[10]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,
[11]

Rogsplat: Robust gaussian splatting via generative priors

Hanyang Kong, Xingyi Yang, and Xinchao Wang. Rogsplat: Robust gaussian splatting via generative priors. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 25735–25745, 2025. 2

2025
[12]

Z-sky: an efficient skyline query processing framework based on z-order.The VLDB Journal, 19(3):333– 362, 2010

Ken CK Lee, Wang-Chien Lee, Baihua Zheng, Huajing Li, and Yuan Tian. Z-sky: an efficient skyline query processing framework based on z-order.The VLDB Journal, 19(3):333– 362, 2010. 3

2010
[13]

3d-hgs: 3d half-gaussian splatting

Haolin Li, Jinyang Liu, Mario Sznaier, and Octavia Camps. 3d-hgs: 3d half-gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10996–11005, 2025. 2

2025
[14]

Pointmamba: A simple state space model for point cloud analysis.Advances in neural information processing systems, 37:32653–32677,

Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, and Xiang Bai. Pointmamba: A simple state space model for point cloud analysis.Advances in neural information processing systems, 37:32653–32677,
[15]

Hybridgs: Decou- pling transients and statics with 2d and 3d gaussian splatting

Jingyu Lin, Jiaqi Gu, Lubin Fan, Bojian Wu, Yujing Lou, Renjie Chen, Ligang Liu, and Jieping Ye. Hybridgs: Decou- pling transients and statics with 2d and 3d gaussian splatting. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 788–797, 2025. 13

2025
[16]

Visualization of large terrains made easy

Peter Lindstrom and Valerio Pascucci. Visualization of large terrains made easy. InProceedings Visualization, 2001. VIS’01., pages 363–574. IEEE, 2001. 3

2001
[17]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 6

2024
[18]

Infinite na- ture: Perpetual view generation of natural scenes from a sin- gle image

Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, and Angjoo Kanazawa. Infinite na- ture: Perpetual view generation of natural scenes from a sin- gle image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14458–14467, 2021. 6

2021
[19]

4dgs-craft: Consistent and interactive 4d gaussian splat- ting editing.IEEE Transactions on Circuits and Systems for Video Technology, 2026

Lei Liu, Can Wang, Zhenghao Chen, Zhenhua Liu, and Dong Xu. 4dgs-craft: Consistent and interactive 4d gaussian splat- ting editing.IEEE Transactions on Circuits and Systems for Video Technology, 2026. 1

2026
[20]

Monosplat: Generalizable 3d gaussian splat- ting from monocular depth foundation models

Yifan Liu, Keyu Fan, Weihao Yu, Chenxin Li, Hao Lu, and Yixuan Yuan. Monosplat: Generalizable 3d gaussian splat- ting from monocular depth foundation models. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 21570–21579, 2025. 1, 2, 11

2025
[21]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2017. 6

2017
[22]

International Business Machines Company, 1966

Guy M Morton.A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company, 1966. 2, 3, 4

1966
[23]

Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022. 13

2022
[24]

Dinov2: Learning robust visual features without super- vision.Transactions on Machine Learning Research, 2025

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without super- vision.Transactions on Machine Learning Research, 2025. 4, 12

2025
[25]

Spatial query processing in an object- oriented database system

Jack A Orenstein. Spatial query processing in an object- oriented database system. InProceedings of the 1986 ACM SIGMOD international conference on Management of data, pages 326–336, 1986. 3

1986
[26]

Redundancy in spatial databases.ACM SIGMOD Record, 18(2):295–305, 1989

Jack A Orenstein. Redundancy in spatial databases.ACM SIGMOD Record, 18(2):295–305, 1989. 3

1989
[27]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 4, 12

2021
[28]

Scube: Instant large-scale scene reconstruction us- ing voxsplats.Advances in Neural Information Processing Systems, 37:97670–97698, 2024

Xuanchi Ren, Yifan Lu, Hanxue Liang, Zhangjie Wu, Huan Ling, Mike Chen, Sanja Fidler, Francis Williams, and Jiahui Huang. Scube: Instant large-scale scene reconstruction us- ing voxsplats.Advances in Neural Information Processing Systems, 37:97670–97698, 2024. 2, 3

2024
[29]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems, 37: 68658–68685, 2024. 6

2024
[30]

Ambient occlusion and edge cueing for enhancing real time molecular visualization.IEEE transactions on visualization and com- puter graphics, 12(5):1237–1244, 2006

Marco Tarini, Paolo Cignoni, and Claudio Montani. Ambient occlusion and edge cueing for enhancing real time molecular visualization.IEEE transactions on visualization and com- puter graphics, 12(5):1237–1244, 2006. 3

2006
[31]

Fast transformers with clustered attention.Advances in Neu- ral Information Processing Systems, 33:21665–21674, 2020

Apoorv Vyas, Angelos Katharopoulos, and Franc ¸ois Fleuret. Fast transformers with clustered attention.Advances in Neu- ral Information Processing Systems, 33:21665–21674, 2020. 5

2020
[32]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 6, 12

2025
[33]

Octformer: Octree-based transformers for 3d point clouds.ACM Transactions on Graphics (TOG), 42(4):1–11, 2023

Peng-Shuai Wang. Octformer: Octree-based transformers for 3d point clouds.ACM Transactions on Graphics (TOG), 42(4):1–11, 2023. 3

2023
[34]

Learning efficient fuse-and-refine for feed- forward 3d gaussian splatting

Yiming Wang, Lucy Chai, Xuan Luo, Michael Niemeyer, Manuel Lagunas, Stephen Lombardi, Siyu Tang, and Tiancheng Sun. Learning efficient fuse-and-refine for feed- forward 3d gaussian splatting. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems,
[35]

Hyrf: Hybrid radiance fields for memory-efficient and high-quality novel view synthesis

Zipeng Wang and Dan Xu. Hyrf: Hybrid radiance fields for memory-efficient and high-quality novel view synthesis. arXiv preprint arXiv:2509.17083, 2025. 13

work page arXiv 2025
[36]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

2004
[37]

Point transformer v3: Simpler faster stronger

Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4840–4851, 2024. 3, 4

2024
[38]

Depthsplat: Connecting gaussian splatting and depth

Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting gaussian splatting and depth. In Proceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 16453–16463, 2025. 1, 2, 6, 7, 8, 11

2025
[39]

Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024. 5, 12

2024
[40]

No pose, no prob- lem: Surprisingly simple 3d gaussian splats from sparse un- posed images

Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no prob- lem: Surprisingly simple 3d gaussian splats from sparse un- posed images. InThe Thirteenth International Conference on Learning Representations, 2025. 1, 2, 11

2025
[41]

gsplat: An open-source library for gaussian splatting.Journal of Machine Learning Research, 26(34):1–17, 2025

Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, et al. gsplat: An open-source library for gaussian splatting.Journal of Machine Learning Research, 26(34):1–17, 2025. 1, 2

2025
[42]

Mip-splatting: Alias-free 3d gaussian splat- ting

Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 19447–19456,
[43]

arXiv preprint arXiv:2502.11089 , year=

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, YX Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware- aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089, 2025. 5

work page arXiv 2025
[44]

Zeta: Lever- agingz-order curves for efficient top-kattention

QIUHAO Zeng, Jerry Huang, Peng Lu, Gezheng Xu, Box- ing Chen, Charles Ling, and Boyu Wang. Zeta: Lever- agingz-order curves for efficient top-kattention. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 4

2025
[45]

Gaussian in the wild: 3d gaussian splatting for unconstrained image collections

Dongbin Zhang, Chuming Wang, Weitao Wang, Peihao Li, Minghan Qin, and Haoqian Wang. Gaussian in the wild: 3d gaussian splatting for unconstrained image collections. In European Conference on Computer Vision, pages 341–359. Springer, 2024. 2

2024
[46]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5, 6

2018
[47]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gor- don Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21936–21947, 2025. 1, 2, 11

2025
[48]

Gps- gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis

Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, and Yebin Liu. Gps- gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19680–19690, 2024. 1, 2

2024
[49]

Generative object insertion in gaussian splatting with a multi-view diffusion model.Visual Informatics, 9(2): 100238, 2025

Hongliang Zhong, Can Wang, Jingbo Zhang, and Jing Liao. Generative object insertion in gaussian splatting with a multi-view diffusion model.Visual Informatics, 9(2): 100238, 2025. 1

2025
[50]

Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018. 6, 11 Supplemental Material

2018
[51]

Here, we include the detailed description of this algorithm, as shown in Algorithm 1

Z-order Based Maximum Coverage View- point Selection Algorithm In the main paper, we provide a high-level introduction to the proposed Z-order based Maximum Coverage View- point Selection algorithm. Here, we include the detailed description of this algorithm, as shown in Algorithm 1. The proposed algorithm aims to select at mostMviewpoints from a dense ca...
[52]

All methods are trained and evaluated on the RealEstate10K dataset [49] using the same training and test- ing split at a resolution of256×256with 2 input views

More Comparisons with Related Works While our comparison baseline, AnySplat [8], has al- ready been shown to outperform NoPoSplat [39] and FLARE [46], and DepthSplat [37] has been shown to sur- pass MVSplat [2] and PixelSplat [1], we have also in- cluded direct comparisons with these methods, as well as MonoSplat [19], to further validate the effectivenes...
[53]

Ours w/o sel

Ablation Studies of Sparse Attention To quantify the contribution of each component in our sparse attention module, we perform ablation experiments on the RealEstate10K dataset with 12 input views under Algorithm 1Z-order Based Maximum Coverage Viewpoint Selection with Max-Heap Require:Viewpoint setV={V 1, V2, . . . , VN }, Number of viewpoints to selectM...
[54]

Since VGGT also utilizes DINOv2 [23] and DPT head [26] structures, we fused the global and geo- metric features extracted from VGGT

Analysis of Using the VGGT Backbone We also considered using VGGT [31] as the backbone for feature extraction instead of the depth-anything-v2- small [38]. Since VGGT also utilizes DINOv2 [23] and DPT head [26] structures, we fused the global and geo- metric features extracted from VGGT. The experiment was also conducted on the RealEstate10K dataset with ...
[55]

We highly recommend watching it to better appreciate the view consis- tency and high fidelity achieved by our method

Supplementary Video We provide a supplementary video showcasing additional visual results rendered from multiple viewpoints. We highly recommend watching it to better appreciate the view consis- tency and high fidelity achieved by our method
[56]

Limitations and Future Work Although our proposed method demonstrates improvements in feed-forward 3D Gaussian Splatting for novel view syn- thesis, there are still certain limitations that could be ad- dressed in future work. One key limitation of our model is that, although it is efficient, it may still face difficulties when processing very high-resolu...