arxiv: 2604.05182 · v1 · submitted 2026-04-06 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

Zhengqin Li , Cheng Zhang , Jakob Engel , Zhao Dong

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords 3D reconstructionobject-centricsparse attentioncontext windownovel view synthesisinverse renderingfeed-forwardtransformer

0 comments

The pith

Scaling context windows in sparse transformers lets feed-forward 3D models recover fine object textures and appearances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LSRM to examine whether substantially larger context windows in transformer models can improve feed-forward 3D reconstruction of individual objects. It shows that handling many more object and image tokens through efficient sparse mechanisms narrows the quality gap with slower per-scene optimization methods. This would matter if true because it could make detailed 3D modeling from sparse images far more practical and scalable. The work reports concrete gains of 2.5 dB higher PSNR and 40 percent lower LPIPS on standard benchmarks, plus better results when the same model is applied to inverse rendering.

Core claim

By increasing active object tokens twentyfold and image tokens more than twofold, LSRM achieves high-fidelity 3D object reconstruction and inverse rendering in a single feed-forward pass, matching or exceeding dense-view optimization on texture and geometry detail.

What carries the argument

The 3D-aware spatial routing mechanism that uses explicit geometric distances to establish 2D-3D correspondences, paired with a coarse-to-fine pipeline that predicts sparse high-resolution residuals.

If this is right

The model processes 20 times more object tokens and more than twice as many image tokens than earlier state-of-the-art feed-forward methods.
It records 2.5 dB higher PSNR and 40 percent lower LPIPS on novel-view synthesis benchmarks.
Texture and geometry details improve consistently when the same architecture is applied to inverse rendering.
LPIPS scores on inverse rendering match or exceed those of current dense-view optimization methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results imply that limited context size, not the transformer design itself, was the main obstacle to high-fidelity feed-forward reconstruction.
The same token-scaling approach with geometric routing could be tested on full scenes or dynamic objects to see whether the gains transfer.
Explicit 3D distance cues in attention may reduce reliance on learned positional encodings for other geometry-aware tasks.
Further context growth might support even higher resolution outputs or more objects per reconstruction without additional per-scene optimization.

Load-bearing premise

That routing attention with geometric distances and adding only sparse high-resolution details will keep every important texture and appearance cue intact on complex real objects without loss or new errors.

What would settle it

A controlled test on the same benchmarks where further increasing context size or altering the geometric routing produces no PSNR or LPIPS gain, or where visual inspection of outputs reveals missing fine details on real-world objects.

Figures

Figures reproduced from arXiv: 2604.05182 by Cheng Zhang, Jakob Engel, Zhao Dong, Zhengqin Li.

**Figure 1.** Figure 1: High-fidelity 3D reconstruction. Given 12–18 images (left), LSRM adapts Native Sparse Attention (NSA) to generate explicit meshes and textures in a single feed-forward pass with minimal compute overhead. As an extension, LSRM can also predict BRDF maps for inverse rendering (right). Zoom in for details. in particular, feed-forward models cut reconstruction time by orders of magnitude compared with optimiz… view at source ↗

**Figure 2.** Figure 2: Network architecture of LSRM. Our method employs a two-stage coarseto-fine pipeline. In Stage 1, a Dense Reconstruction Transformer generates a coarse, low-resolution volume. Stage 2 utilizes this coarse volume to initialize the active sparse volume tokens and predicts high-resolution sparse residuals, constructing the final 3D representation. adapt the Triton implementation from [30, 65] and set hq = 32 … view at source ↗

**Figure 3.** Figure 3: Attentionscore vs. 3D-aware selection. Standard attention captures 2D-3D correspondence only in deep layers, whereas our 3D-aware routing ensures stable selection to enhance texture. which significantly accelerates convergence. Furthermore, instead of predicting the high-resolution sparse volume from scratch, we train the sparse transformer to predict the residual over the dense prediction, an approach w… view at source ↗

**Figure 4.** Figure 4: Token count vs. training time. Distribution of active tokens and their corresponding training times for resolutions S img = 96 and S vol = 64. Each point represents a single data instance. GPU 0 4 blocks 14 tokens GPU 1 3 blocks 8 tokens GPU 2 1 blocks 2 tokens GPU 0 2 blocks 7 tokens GPU 1 2 blocks 7 tokens GPU 2 4 blocks 10 tokens All-to-All GPU 0 2 blocks 7 KVs GPU 1 2 blocks 7 KVs All-gather All-to-… view at source ↗

**Figure 5.** Figure 5: Custom sequence parallelism. Toy example assuming ≤4 tokens per spatial block across 3 GPUs. (a) Tokens are distributed evenly without breaking spatial blocks. (b) Tokens return to their source GPUs for decoding and volume rendering. (c) Allgather-KV ensures global context before every NSACrossAttn layer. its voxel-center position. For an image token corresponding to an 8 × 8 patch, its 3D location p img … view at source ↗

**Figure 6.** Figure 6: Qualitative NVS Results on the GSO Dataset. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative Ablation Study on the GSO Dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative Inverse Rendering Results. Visual comparisons against optimization-based and feed-forward baselines demonstrate that LSRM consistently recovers higher-fidelity textures and finer geometric details. Generalization to Smaller Number of Views Since the primary focus of LSRM is to significantly scale the context window for both image and volume tokens, we maintain the number of input images between… view at source ↗

**Figure 9.** Figure 9: Qualitative 3D reconstruction and inverse rendering results with different number of input views. Metric LIRM [41] Ours (LSRM) 8 views 16 views 8 views 16 views PSNR (↑) 30.48 30.56 32.46 33.08 SSIM (↑) 0.947 0.948 0.968 0.971 LPIPS (↓) 0.056 0.054 0.031 0.028 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

We introduce the Large Sparse Reconstruction Model to study how scaling transformer context windows impacts feed-forward 3D reconstruction. Although recent object-centric feed-forward methods deliver robust, high-quality reconstruction, they still lag behind dense-view optimization in recovering fine-grained texture and appearance. We show that expanding the context window -- by substantially increasing the number of active object and image tokens -- remarkably narrows this gap and enables high-fidelity 3D object reconstruction and inverse rendering. To scale effectively, we adapt native sparse attention in our architecture design, unlocking its capacity for 3D reconstruction with three key contributions: (1) an efficient coarse-to-fine pipeline that focuses computation on informative regions by predicting sparse high-resolution residuals; (2) a 3D-aware spatial routing mechanism that establishes accurate 2D-3D correspondences using explicit geometric distances rather than standard attention scores; and (3) a custom block-aware sequence parallelism strategy utilizing an All-gather-KV protocol to balance dynamic, sparse workloads across GPUs. As a result, LSRM handles 20x more object tokens and >2x more image tokens than prior state-of-the-art (SOTA) methods. Extensive evaluations on standard novel-view synthesis benchmarks show substantial gains over the current SOTA, yielding 2.5 dB higher PSNR and 40% lower LPIPS. Furthermore, when extending LSRM to inverse rendering tasks, qualitative and quantitative evaluations on widely-used benchmarks demonstrate consistent improvements in texture and geometry details, achieving an LPIPS that matches or exceeds that of SOTA dense-view optimization methods. Code and model will be released on our project page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling sparse context windows with geometric routing pushes feed-forward 3D object reconstruction closer to optimization quality on standard benchmarks.

read the letter

The core result is that LSRM scales up active tokens by 20x for objects and more than 2x for images, delivering 2.5 dB higher PSNR and 40% lower LPIPS than prior feed-forward methods while matching dense optimization on inverse rendering tasks. The work adapts native sparse attention through three concrete pieces: a coarse-to-fine residual pipeline, 3D-aware routing that uses explicit geometric distances for 2D-3D correspondences, and block-aware All-gather-KV parallelism to handle the uneven sparse workload across GPUs. These changes are practical engineering steps that let the model process far larger contexts without blowing up compute, and the reported gains on novel-view synthesis benchmarks are quantitative and consistent with the abstract claims. The parallelism strategy in particular looks like a useful contribution for anyone running large sparse transformers on 3D data. The geometric routing choice is the part that needs the most scrutiny. It is deterministic and tied to the coarse geometry estimate, so any mismatch between that geometry and actual appearance details (view-dependent effects, fine markings, or speculars) could drop or misroute high-resolution residuals. The paper shows the overall pipeline works on the tested data, but without ablations that swap in learned attention-based routing or measure how often the geometric assumption fails on complex real objects, it is difficult to separate the benefit of raw scale from the benefit of this specific routing rule. The experimental section appears to use public benchmarks with clear metrics, which is a plus, though full details on splits and controls would strengthen the case. This paper is aimed at researchers building scalable feed-forward 3D pipelines for AR or robotics who already work with transformers. It deserves peer review because the scaling results are concrete, the adaptations are reproducible in principle, and the quantitative improvements are large enough to matter if they hold up under closer inspection.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Large Sparse Reconstruction Model (LSRM) to investigate scaling transformer context windows for feed-forward 3D object reconstruction. It adapts sparse attention via a coarse-to-fine pipeline predicting sparse high-resolution residuals, a 3D-aware spatial routing mechanism using explicit geometric distances for 2D-3D correspondences, and block-aware sequence parallelism with All-gather-KV. The model processes 20x more object tokens and >2x image tokens than prior SOTA, reporting 2.5 dB PSNR gains and 40% lower LPIPS on novel-view synthesis benchmarks, plus competitive inverse rendering results.

Significance. If the empirical gains hold under full scrutiny, the work demonstrates that context scaling in object-centric feed-forward models can substantially close the fidelity gap to dense-view optimization methods for texture and appearance recovery. The engineering contributions for efficient sparse 3D-aware processing are practically significant for scaling vision transformers. Planned code and model release supports reproducibility.

major comments (3)

[Experiments] Experiments section: The headline 2.5 dB PSNR and 40% LPIPS improvements are presented without explicit details on train/test data splits, baseline re-implementations, hyperparameter search protocols, or full ablation tables isolating context scaling from the routing and residual components. This prevents confirmation that gains are not attributable to post-hoc selection or metric-specific tuning.
[Method] Method, 3D-aware spatial routing subsection: The routing uses deterministic geometric distances derived from coarse-stage estimates rather than learned attention scores. No quantitative analysis (e.g., routing error rates or ablation vs. feature-similarity routing) is provided to verify that this preserves fine-grained appearance details such as specular highlights or view-dependent effects on complex objects; misalignment here would discard residuals before the high-resolution stage can recover them.
[Inverse rendering evaluations] Inverse rendering results: Claims that LPIPS matches or exceeds dense-view SOTA are stated without tabulated quantitative metrics, exact benchmark splits, or controls for the same number of input views as the NVS experiments, making the extension claim difficult to evaluate independently.

minor comments (2)

[Method] Notation for token counts (object vs. image) and sparsity ratios should be defined once in a table and used consistently in text and figures.
[Figures] Figure captions for qualitative results should explicitly note the input view count and highlight specific regions where texture/geometry improvements are visible.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, outlining the revisions we will make to improve clarity and rigor.

read point-by-point responses

Referee: [Experiments] Experiments section: The headline 2.5 dB PSNR and 40% LPIPS improvements are presented without explicit details on train/test data splits, baseline re-implementations, hyperparameter search protocols, or full ablation tables isolating context scaling from the routing and residual components. This prevents confirmation that gains are not attributable to post-hoc selection or metric-specific tuning.

Authors: We agree that the current presentation would benefit from greater transparency on these experimental details. In the revised manuscript, we will expand the Experiments section with explicit descriptions of the train/test data splits, the exact protocols used for baseline re-implementations (including hyperparameter search ranges and selection criteria), and comprehensive ablation tables that separately quantify the contributions of context scaling, 3D-aware routing, and residual prediction. These additions will allow independent verification that the reported gains stem from the proposed scaling approach rather than tuning artifacts. revision: yes
Referee: [Method] Method, 3D-aware spatial routing subsection: The routing uses deterministic geometric distances derived from coarse-stage estimates rather than learned attention scores. No quantitative analysis (e.g., routing error rates or ablation vs. feature-similarity routing) is provided to verify that this preserves fine-grained appearance details such as specular highlights or view-dependent effects on complex objects; misalignment here would discard residuals before the high-resolution stage can recover them.

Authors: The deterministic geometric routing is chosen specifically to enforce accurate 2D-3D correspondences based on explicit 3D structure, which we argue is more reliable than learned attention scores for maintaining geometric fidelity in sparse settings. We acknowledge, however, that empirical validation of this choice is valuable. In the revision, we will add quantitative metrics such as routing error rates on validation data and an ablation comparing geometric routing against feature-similarity routing, with focused analysis on preservation of specular highlights and view-dependent effects. revision: yes
Referee: [Inverse rendering evaluations] Inverse rendering results: Claims that LPIPS matches or exceeds dense-view SOTA are stated without tabulated quantitative metrics, exact benchmark splits, or controls for the same number of input views as the NVS experiments, making the extension claim difficult to evaluate independently.

Authors: The manuscript reports both qualitative and quantitative results for inverse rendering, but we agree that the quantitative component would be more accessible with explicit tables and controls. In the revised version, we will include detailed tables of LPIPS and other metrics for inverse rendering, specify the exact benchmark splits used, and confirm that input view counts are matched to those in the novel-view synthesis experiments for fair comparison. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; results are empirical benchmark comparisons

full rationale

The paper introduces LSRM as an architectural extension of transformer-based 3D reconstruction that scales context windows via sparse attention, a coarse-to-fine pipeline, geometric routing, and parallelism. Its headline quantitative claims (2.5 dB PSNR, 40% LPIPS reduction) are presented solely as outcomes of training and evaluating the model on standard novel-view synthesis and inverse-rendering benchmarks against prior SOTA. No equations, fitted parameters, or self-referential definitions appear that would reduce these gains to quantities computed from the same test data or prior self-citations. The derivation chain consists of design choices whose validity is tested externally via public benchmarks, rendering the result self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard transformer attention assumptions plus geometric priors for routing; no new physical entities or unstated mathematical axioms are introduced beyond the empirical architecture.

pith-pipeline@v0.9.0 · 5600 in / 1148 out tokens · 34730 ms · 2026-05-10T19:25:53.466470+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a 3D-aware spatial routing mechanism that establishes accurate 2D-3D correspondences using explicit geometric distances rather than standard attention scores
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

expanding the context window -- by substantially increasing the number of active object and image tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

89 extracted references · 31 canonical work pages · 9 internal anchors

[1]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Ainslie, J., Lee-Thorp, J., De Jong, M., Zemlyanskiy, Y., Lebrón, F., Sanghai, S.: Gqa: Training generalized multi-query transformer models from multi-head check- points. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 4895–4901 (2023)

2023
[2]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Bi, S., Xu, Z., Sunkavalli, K., Kriegman, D., Ramamoorthi, R.: Deep 3d capture: Geometry and reflectance from sparse multi-view images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5960–5969 (2020)

2020
[3]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Boss, M., Braun, R., Jampani, V., Barron, J.T., Liu, C., Lensch, H.: Nerd: Neural reflectance decomposition from image collections. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12684–12694 (2021)

2021
[4]

Advances in Neural Information Processing Systems35, 26389–26403 (2022)

Boss, M., Engelhardt, A., Kar, A., Li, Y., Sun, D., Barron, J., Lensch, H., Jampani, V.: Samurai: Shape and material from unconstrained real-world arbitrary image collections. Advances in Neural Information Processing Systems35, 26389–26403 (2022)

2022
[5]

Advances in Neural Information Processing Systems34, 10691–10704 (2021)

Boss, M., Jampani, V., Braun, R., Liu, C., Barron, J., Lensch, H.: Neural-pil: Neural pre-integrated lighting for reflectance decomposition. Advances in Neural Information Processing Systems34, 10691–10704 (2021)

2021
[6]

ACM Transactions on Graphics43(1), 1–24 (2023)

Careaga, C., Aksoy, Y.: Intrinsic image decomposition via ordinal shading. ACM Transactions on Graphics43(1), 1–24 (2023)

2023
[7]

In: European conference on computer vision

Chen, A., Xu, H., Esposito, S., Tang, S., Geiger, A.: Lara: Efficient large-baseline radiance fields. In: European conference on computer vision. pp. 338–355. Springer (2024)

2024
[8]

SAM 3D: 3Dfy Anything in Images

Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., et al.: Sam 3d: 3dfy anything in images. arXiv preprint arXiv:2511.16624 (2025)

work page internal anchor Pith review arXiv 2025
[9]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, Z., Tang, J., Dong, Y., Cao, Z., Hong, F., Lan, Y., Wang, T., Xie, H., Wu, T., Saito, S., et al.: 3dtopia-xl: Scaling high-quality 3d asset generation via prim- itive diffusion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26576–26586 (2025)

2025
[10]

Generating Long Sequences with Sparse Transformers

Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019)

work page internal anchor Pith review arXiv 1904
[11]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Dao, T.: Flashattention-2: Faster attention with better parallelism and work par- titioning. arXiv preprint arXiv:2307.08691 (2023)

work page internal anchor Pith review arXiv 2023
[12]

In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques

Debevec, P., Hawkins, T., Tchou, C., Duiker, H.P., Sarokin, W., Sagar, M.: Ac- quiring the reflectance field of a human face. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques. pp. 145–156 (2000) High-Fidelity Object-Centric Reconstruction via Scaled Context Windows 17

2000
[13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13142–13153 (2023)

2023
[14]

ACM Transactions on Graph- ics (ToG)37(4), 1–15 (2018)

Deschaintre, V., Aittala, M., Durand, F., Drettakis, G., Bousseau, A.: Single-image svbrdf capture with a rendering-aware deep network. ACM Transactions on Graph- ics (ToG)37(4), 1–15 (2018)

2018
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Dong, Z., Chen, K., Lv, Z., Yu, H.X., Zhang, Y., Zhang, C., Zhu, Y., Tian, S., Li, Z., Moffatt, G., et al.: Digital twin catalog: A large-scale photorealistic 3d object digital twin dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 753–763 (2025)

2025
[16]

In: 2022 International Conference on Robotics and Automation (ICRA)

Downs, L., Francis, A., Koenig, N., Kinman, B., Hickman, R., Reymann, K., McHugh, T.B., Vanhoucke, V.: Google scanned objects: A high-quality dataset of 3d scanned household items. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 2553–2560. IEEE (2022)

2022
[17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Engelhardt, A., Raj, A., Boss, M., Zhang, Y., Kar, A., Li, Y., Sun, D., Brualla, R.M., Barron, J.T., Lensch, H., et al.: Shinobi: Shape and illumination using neu- ral object decomposition via brdf optimization in-the-wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19636– 19646 (2024)

2024
[18]

40 Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel

Fang, J., Zhao, S.: Usp: A unified sequence parallelism approach for long context generative ai. arXiv preprint arXiv:2405.07719 (2024)

work page arXiv 2024
[19]

IEEE Transactions on Pattern Analysis and Machine Intelligence32(6), 1060–1071 (2009)

Goldman, D.B., Curless, B., Hertzmann, A., Seitz, S.M.: Shape and spatially- varying brdfs from photometric stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence32(6), 1060–1071 (2009)

2009
[20]

arXiv:2303.05371 , year=

Gupta, A., Xiong, W., Nie, Y., Jones, I., Oğuz, B.: 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371 (2023)

work page arXiv 2023
[21]

Advances in Neural Information Processing Systems35, 22856–22869 (2022)

Hasselgren, J., Hofmann, N., Munkberg, J.: Shape, light, and material decomposi- tion from images using monte carlo rendering and denoising. Advances in Neural Information Processing Systems35, 22856–22869 (2022)

2022
[22]

LRM: Large Reconstruction Model for Single Image to 3D

Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400 (2023)

work page internal anchor Pith review arXiv 2023
[23]

arXiv2506.15442(2025) 10

Hunyuan3D, T., Yang, S., Yang, M., Feng, Y., Huang, X., Zhang, S., He, Z., Luo, D., Liu, H., Zhao, Y., et al.: Hunyuan3d 2.1: From images to high-fidelity 3d assets with production-ready pbr material. arXiv preprint arXiv:2506.15442 (2025)

work page arXiv 2025
[24]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Jacobs, S.A., Tanaka, M., Zhang, C., Zhang, M., Song, S.L., Rajbhandari, S., He, Y.: Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509 (2023)

work page internal anchor Pith review arXiv 2023
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jiang, Y., Tu, J., Liu, Y., Gao, X., Long, X., Wang, W., Ma, Y.: Gaussianshader: 3d gaussian splatting with shading functions for reflective surfaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5322–5332 (2024)

2024
[26]

Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024

Jin, H., Jiang, H., Tan, H., Zhang, K., Bi, S., Zhang, T., Luan, F., Snavely, N., Xu, Z.: Lvsm: A large view synthesis model with minimal 3d inductive bias. arXiv preprint arXiv:2410.17242 (2024)

work page arXiv 2024
[27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jin, H., Liu, I., Xu, P., Zhang, X., Han, S., Bi, S., Zhou, X., Xu, Z., Su, H.: Tensoir: Tensorial inverse rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 165–174 (2023) 18 Z. Li et al

2023
[28]

Proceedings of Machine Learning and Systems5, 341–353 (2023)

Korthikanti, V.A., Casper, J., Lym, S., McAfee, L., Andersch, M., Shoeybi, M., Catanzaro, B.: Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems5, 341–353 (2023)

2023
[29]

Advances in Neural Information Processing Systems36(2024)

Kuang,Z.,Zhang,Y.,Yu,H.X.,Agarwala,S.,Wu,E.,Wu,J.,etal.:Stanford-orb:a real-world 3d object inverse rendering benchmark. Advances in Neural Information Processing Systems36(2024)

2024
[30]

Lai, X., Lu, J.: native-sparse-attention-triton: Efficient triton implementation of Native Sparse Attention.https://github.com/XunhaoLai/native- sparse- attention-triton(2025)

2025
[31]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

Lan, Y., Hong, F., Zhou, S., Yang, S., Meng, X., Chen, Y., Lyu, Z., Dai, B., Pan, X., Loy, C.C.: Ln3diff++: Scalable latent neural fields diffusion for speedy 3d generation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

2025
[32]

Lan, Y., Zhou, S., Lyu, Z., Hong, F., Yang, S., Dai, B., Pan, X., Loy, C.C.: Gaus- siananything: Interactive point cloud latent diffusion for 3d generation (2025)

2025
[33]

In: European conference on computer vision

Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3d with mast3r. In: European conference on computer vision. pp. 71–91. Springer (2024)

2024
[34]

Craftsman: High-fidelity mesh generation with 3d native generation and interactive geometry refiner.arXiv preprint arXiv:2405.14979, 2024

Li, W., Liu, J., Yan, H., Chen, R., Liang, Y., Chen, X., Tan, P., Long, X.: Crafts- man3d: High-fidelity mesh generation with 3d native generation and interactive geometry refiner. arXiv preprint arXiv:2405.14979 (2024)

work page arXiv 2024
[35]

Step1X-3D: Towards high-fidelity and controllable generation of textured 3D assets, 2025

Li, W., Zhang, X., Sun, Z., Qi, D., Li, H., Cheng, W., Cai, W., Wu, S., Liu, J., Wang, Z., et al.: Step1x-3d: Towards high-fidelity and controllable generation of textured 3d assets. arXiv preprint arXiv:2505.07747 (2025)

work page arXiv 2025
[36]

ACM Transac- tions on Graphics (ToG)36(4), 1–11 (2017)

Li, X., Dong, Y., Peers, P., Tong, X.: Modeling surface appearance from a single photograph using self-augmented convolutional neural networks. ACM Transac- tions on Graphics (ToG)36(4), 1–11 (2017)

2017
[37]

IEEE Transactions on Pattern Analysis and Machine Intel- ligence (2025)

Li, Y., Zou, Z.X., Liu, Z., Wang, D., Liang, Y., Yu, Z., Liu, X., Guo, Y.C., Liang, D., Ouyang, W., et al.: Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models. IEEE Transactions on Pattern Analysis and Machine Intel- ligence (2025)

2025
[38]

In: Proceedings of the European conference on com- puter vision (ECCV)

Li, Z., Snavely, N.: Cgintrinsics: Better intrinsic image decomposition through physically-based rendering. In: Proceedings of the European conference on com- puter vision (ECCV). pp. 371–387 (2018)

2018
[39]

In: European Conference on Computer Vision

Li, Z., Shi, J., Bi, S., Zhu, R., Sunkavalli, K., Hašan, M., Xu, Z., Ramamoorthi, R., Chandraker, M.: Physically-based editing of indoor scene lighting from a single image. In: European Conference on Computer Vision. pp. 555–572. Springer (2022)

2022
[40]

In: Proceedings of the European conference on computer vision (ECCV)

Li, Z., Sunkavalli, K., Chandraker, M.: Materials for masses: Svbrdf acquisition with a single mobile phone image. In: Proceedings of the European conference on computer vision (ECCV). pp. 72–87 (2018)

2018
[41]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Li, Z., Wang, D., Chen, K., Lv, Z., Nguyen-Phuoc, T., Lee, M., Huang, J.B., Xiao, L., Zhu, Y., Marshall, C.S., et al.: Lirm: Large inverse rendering model for progressive reconstruction of shape, materials and view-dependent radiance fields. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 505–517 (2025)

2025
[42]

ACM Trans- actions on Graphics (TOG)37(6), 1–11 (2018)

Li, Z., Xu, Z., Ramamoorthi, R., Sunkavalli, K., Chandraker, M.: Learning to re- construct shape and spatially-varying reflectance from a single image. ACM Trans- actions on Graphics (TOG)37(6), 1–11 (2018)

2018
[43]

arXiv preprint arXiv:2505.14521 , year=

Li, Z., Wang, Y., Zheng, H., Luo, Y., Wen, B.: Sparc3d: Sparse representa- tion and construction for high-resolution 3d shapes modeling. arXiv preprint arXiv:2505.14521 (2025) High-Fidelity Object-Centric Reconstruction via Scaled Context Windows 19

work page arXiv 2025
[44]

arXiv preprint arXiv:2412.03526 (2024) 2

Liang, H., Ren, J., Mirzaei, A., Torralba, A., Liu, Z., Gilitschenski, I., Fidler, S., Oztireli, C., Ling, H., Gojcic, Z., et al.: Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos. arXiv preprint arXiv:2412.03526 (2024)

work page arXiv 2024
[45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liang, Z., Zhang, Q., Feng, Y., Shan, Y., Jia, K.: Gs-ir: 3d gaussian splatting for inverse rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21644–21653 (2024)

2024
[46]

arXiv preprint arXiv:2506.09997 (2025)

Lin, C.H., Lv, Z., Wu, S., Xu, Z., Nguyen-Phuoc, T., Tseng, H.Y., Straub, J., Khan, N., Xiao, L., Yang, M.H., et al.: Dgs-lrm: Real-time deformable 3d gaussian reconstruction from monocular videos. arXiv preprint arXiv:2506.09997 (2025)

work page arXiv 2025
[47]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review arXiv 2025
[48]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Liu, H., Zaharia, M., Abbeel, P.: Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889 (2023)

work page internal anchor Pith review arXiv 2023
[49]

arXiv preprint arXiv:2506.18890 (2025)

Ma, Z., Chen, X., Yu, S., Bi, S., Zhang, K., Ziwen, C., Xu, S., Yang, J., Xu, Z., Sunkavalli, K., et al.: 4d-lrm: Large space-time reconstruction model from and to any view at any time. arXiv preprint arXiv:2506.18890 (2025)

work page arXiv 2025
[50]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Meka, A., Maximov, M., Zollhoefer, M., Chatterjee, A., Seidel, H.P., Richardt, C., Theobalt, C.: Lime: Live intrinsic material estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6315–6324 (2018)

2018
[51]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Munkberg, J., Hasselgren, J., Shen, T., Gao, J., Chen, W., Evans, A., Müller, T., Fidler, S.: Extracting triangular 3d models, materials, and lighting from images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8280–8290 (2022)

2022
[52]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., et al.: Gated attention for large language models: Non- linearity, sparsity, and attention-sink-free. arXiv preprint arXiv:2505.06708 (2025)

work page internal anchor Pith review arXiv 2025
[53]

arXiv preprint arXiv:2312.05133 (2023)

Shi, Y., Wu, Y., Wu, C., Liu, X., Zhao, C., Feng, H., Liu, J., Zhang, L., Zhang, J., Zhou, B., et al.: Gir: 3d gaussian inverse rendering for relightable scene factoriza- tion. arXiv preprint arXiv:2312.05133 (2023)

work page arXiv 2023
[54]

Meta 3d assetgen: Text-to-mesh generation with high-quality geometry, texture, and pbr materials.arXiv preprint arXiv:2407.02445, 2024

Siddiqui, Y., Monnier, T., Kokkinos, F., Kariya, M., Kleiman, Y., Garreau, E., Gafni, O., Neverova, N., Vedaldi, A., Shapovalov, R., et al.: Meta 3d assetgen: Text-to-mesh generation with high-quality geometry, texture, and pbr materials. arXiv preprint arXiv:2407.02445 (2024)

work page arXiv 2024
[55]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Sun, C., Cai, G., Li, Z., Yan, K., Zhang, C., Marshall, C., Huang, J.B., Zhao, S., Dong, Z.: Neural-pbir reconstruction of shape, material, and illumination. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18046–18056 (2023)

2023
[57]

https://openai.com/index/triton/(July 2021), openAI Blog

Tillet, P.: Introducing Triton: Open-source GPU programming for neural networks. https://openai.com/index/triton/(July 2021), openAI Blog

2021
[58]

In: 2024 Interna- tional Conference on 3D Vision (3DV)

Ummenhofer, B., Agrawal, S., Sepulveda, R., Lao, Y., Zhang, K., Cheng, T., Richter, S., Wang, S., Ros, G.: Objects with lighting: A real-world dataset for evaluating reconstruction and rendering for object relighting. In: 2024 Interna- tional Conference on 3D Vision (3DV). pp. 137–147. IEEE (2024)

2024
[59]

Advances in neural infor- mation processing systems35, 10021–10039 (2022) 20 Z

Vahdat, A., Williams, F., Gojcic, Z., Litany, O., Fidler, S., Kreis, K., et al.: Lion: Latent point diffusion models for 3d shape generation. Advances in neural infor- mation processing systems35, 10021–10039 (2022) 20 Z. Li et al

2022
[60]

Advances in Neural Information Processing Systems (2017)

Vaswani, A.: Attention is all you need. Advances in Neural Information Processing Systems (2017)

2017
[61]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

2025
[62]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20697–20709 (2024)

2024
[63]

arXiv preprint arXiv:2404.12385 , year=

Wei, X., Zhang, K., Bi, S., Tan, H., Luan, F., Deschaintre, V., Sunkavalli, K., Su, H., Xu, Z.: Meshlrm: Large reconstruction model for high-quality mesh. arXiv preprint arXiv:2404.12385 (2024)

work page arXiv 2024
[64]

Advances in Neural Information Processing Systems37, 121859–121881 (2024)

Wu, S., Lin, Y., Zhang, F., Zeng, Y., Xu, J., Torr, P., Cao, X., Yao, Y.: Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer. Advances in Neural Information Processing Systems37, 121859–121881 (2024)

2024
[65]

Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025

Wu, S., Lin, Y., Zhang, F., Zeng, Y., Yang, Y., Bao, Y., Qian, J., Zhu, S., Cao, X., Torr, P., et al.: Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention. arXiv preprint arXiv:2505.17412 (2025)

work page arXiv 2025
[66]

Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025

Xiang, J., Chen, X., Xu, S., Wang, R., Lv, Z., Deng, Y., Zhu, H., Dong, Y., Zhao, H., Yuan, N.J., et al.: Native and compact structured latents for 3d generation. arXiv preprint arXiv:2512.14692 (2025)

work page arXiv 2025
[67]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21469–21480 (2025)

2025
[68]

In: European Conference on Computer Vision

Xu, Y., Shi, Z., Yifan, W., Chen, H., Yang, C., Peng, S., Shen, Y., Wetzstein, G.: Grm: Large gaussian reconstruction model for efficient 3d reconstruction and gen- eration. In: European Conference on Computer Vision. pp. 1–20. Springer (2024)

2024
[69]

arXiv preprint arXiv:2506.08015 (2025)

Xu, Z., Li, Z., Dong, Z., Zhou, X., Newcombe, R., Lv, Z.: 4dgt: Learning a 4d gaussian transformer using real-world monocular videos. arXiv preprint arXiv:2506.08015 (2025)

work page arXiv 2025
[70]

arXiv preprint arXiv:2408.13055 (2024)

Yang, H., Dong, Y., Jiang, H., Xu, D., Pavlakos, G., Huang, Q.: Atlas gaussians diffusion for 3d generation. arXiv preprint arXiv:2408.13055 (2024)

work page arXiv 2024
[71]

arXiv preprint arXiv:2310.13030 (2023)

Yang, Z., Chen, Y., Gao, X., Yuan, Y., Wu, Y., Zhou, X., Jin, X.: Sire-ir: Inverse rendering for brdf reconstruction with shadow and illumination removal in high- illuminance scenes. arXiv preprint arXiv:2310.13030 (2023)

work page arXiv 2023
[72]

Advancesin Neural Information ProcessingSystems34,4805–4815(2021)

Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit surfaces. Advancesin Neural Information ProcessingSystems34,4805–4815(2021)

2021
[73]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Yuan, J., Gao, H., Dai, D., Luo, J., Zhao, L., Zhang, Z., Xie, Z., Wei, Y., Wang, L., Xiao, Z., et al.: Native sparse attention: Hardware-aligned and natively trainable sparse attention. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 23078–23097 (2025)

2025
[74]

Advances in neural information processing systems33, 17283–17297 (2020)

Zaheer, M., Guruganesh, G., Dubey, K.A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al.: Big bird: Transformers for longer sequences. Advances in neural information processing systems33, 17283–17297 (2020)

2020
[75]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zarzar, J., Monnier, T., Shapovalov, R., Vedaldi, A., Novotny, D.: Twinner: Shining light on digital twins in a few snaps. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5859–5869 (2025)

2025
[76]

Rgb↔x: Image decomposition and synthesis using material- and lighting-aware diffusion models

Zeng, Z., Deschaintre, V., Georgiev, I., Hold-Geoffroy, Y., Hu, Y., Luan, F., Yan, L.Q., Hašan, M.: Rgbx: Image decomposition and synthesis using material- and High-Fidelity Object-Centric Reconstruction via Scaled Context Windows 21 lighting-aware diffusion models. In: ACM SIGGRAPH 2024 Conference Papers. SIGGRAPH ’24, Association for Computing Machinery...

work page doi:10.1145/3641519.3657445 2024
[77]

ACM Transactions On Graphics (TOG)42(4), 1–16 (2023)

Zhang, B., Tang, J., Niessner, M., Wonka, P.: 3dshape2vecset: A 3d shape repre- sentation for neural fields and generative diffusion models. ACM Transactions On Graphics (TOG)42(4), 1–16 (2023)

2023
[78]

Zhang, G

Zhang, C., Moing, G.L., Koppula, S., Rocco, I., Momeni, L., Xie, J., Sun, S., Sukthankar, R., Barral, J.K., Hadsell, R., et al.: Efficiently reconstructing dynamic scenes one d4rt at a time. arXiv preprint arXiv:2512.08924 (2025)

work page arXiv 2025
[79]

In: European Conference on Computer Vision

Zhang, K., Bi, S., Tan, H., Xiangli, Y., Zhao, N., Sunkavalli, K., Xu, Z.: Gs-lrm: Large reconstruction model for 3d gaussian splatting. In: European Conference on Computer Vision. pp. 1–19. Springer (2025)

2025
[80]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, K., Luan, F., Li, Z., Snavely, N.: Iron: Inverse rendering by optimizing neu- ral sdfs and materials from photometric images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5565–5574 (2022)

2022

Showing first 80 references.