Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction

Chen Ziwen; Hao Tan; Li Fuxin; Peng Wang; Zexiang Xu

arxiv: 2512.10267 · v2 · submitted 2025-12-11 · 💻 cs.CV

Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction

Chen Ziwen , Hao Tan , Peng Wang , Zexiang Xu , Li Fuxin This is my paper

Pith reviewed 2026-05-16 23:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords feed-forward reconstructionGaussian splattingreal-time renderingnovel view synthesisscene reconstructiondepth predictionimplicit representations

0 comments

The pith

A semi-explicit scene representation with a lightweight decoder preserves fine details in feed-forward wide-coverage reconstruction while delivering real-time performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Long-LRM++ to resolve the conflict between high-fidelity rendering and speed in generalizable Gaussian splatting methods that handle dozens of input views. Earlier explicit approaches that directly output millions of Gaussian parameters suffer from blurring in fine structures such as text, while implicit methods achieve better detail by embedding scene information in model weights but require expensive per-frame decompression. Long-LRM++ instead uses a semi-explicit representation decoded by a lightweight network, retaining the quality of full implicit models such as LaCT yet reaching 14 FPS on an A100 GPU. The design also extends to 64 input views at 950 by 540 resolution and improves novel-view depth accuracy over direct Gaussian depth rendering.

Core claim

Long-LRM++ adopts a semi-explicit scene representation combined with a lightweight decoder. This design matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU. The approach also scales to 64 input views at the 950 by 540 resolution and delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians.

What carries the argument

Semi-explicit scene representation decoded by a lightweight network that avoids full transformer-based decompression for each rendered frame.

If this is right

Direct prediction of millions of Gaussian parameters can be replaced without sacrificing detail fidelity.
Real-time 360-degree scene reconstruction becomes practical from 32 or 64 input views in a single forward pass.
Novel-view depth maps can be obtained more accurately than by rendering from explicit Gaussians.
The same semi-explicit plus lightweight-decoder pattern can support higher input resolutions or longer view sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may reduce sensitivity to input view ordering or camera calibration errors compared with purely explicit methods.
Real-time capability could enable online reconstruction pipelines in robotics or mobile AR where latency matters.
The design suggests that many implicit decompression steps can be pre-computed into a compact explicit layer without quality loss.

Load-bearing premise

A lightweight decoder on top of a semi-explicit representation can preserve fine details as effectively as full transformer-based implicit decompression without introducing new blurring artifacts.

What would settle it

A side-by-side rendering test on DL3DV scenes containing text or fine structures where Long-LRM++ shows measurable blurring or loss of sharpness that LaCT does not.

Figures

Figures reproduced from arXiv: 2512.10267 by Chen Ziwen, Hao Tan, Li Fuxin, Peng Wang, Zexiang Xu.

**Figure 2.** Figure 2: Overview of the Long-LRM++ architecture. Long-LRM++ takes up to 64 input images at 950 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of novel-view rendering on DL3DV (32-input, 960 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of novel-view color and depth rendering on ScanNetv2 (128-input, 448 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at $950\times540$ resolution, achieving 360{\deg} scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential "decompression" process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the $950\times540$ resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Long-LRM++ pairs a semi-explicit representation with a lightweight decoder to hit real-time speeds while matching LaCT quality on DL3DV and scaling to 64 views, though the fine-detail preservation claim rests on aggregate numbers that could use more targeted checks.

read the letter

Long-LRM++ takes the feed-forward Gaussian splatting approach from Long-LRM and adds a semi-explicit scene representation plus a lightweight decoder. The result is real-time 14 FPS rendering on an A100 while matching LaCT quality on DL3DV and scaling cleanly to 64 input views at 950x540. It also reports better novel-view depth on ScanNetv2 than direct Gaussian depth rendering. The design keeps the single-pass property but avoids the per-frame heavy transformer cost that makes prior implicit methods too slow for real-time use. Ablations are included to test each component, which helps show where the gains come from. The citation pattern stays within the standard Long-LRM and LaCT line without circularity or invented entities. These pieces make the work practically relevant for anyone needing both coverage and speed in novel-view synthesis. The soft spot sits in the fine-detail claim. The abstract correctly notes that direct Gaussian prediction blurs text and edges, then positions the lightweight decoder as the solution that recovers fidelity without full implicit decompression. Yet the reported numbers are overall PSNR/SSIM/LPIPS, with no breakdown on high-frequency regions or visual comparisons focused on text and sharp structures. If the decoder trades capacity for speed, the equivalence to LaCT could weaken exactly where error sensitivity was the original problem. That part of the argument would be stronger with region-specific metrics or failure-case analysis. This paper is for researchers building real-time systems for VR, AR, or robotics who already follow the generalizable GS literature. It is not a foundational shift but a targeted engineering step that addresses a clear speed-quality tension. I would bring it to a reading group to see the implementation details and the scaling curves. It deserves peer review because the claims are testable on public datasets, the baselines are external, and the practical numbers are worth referee scrutiny even if the core hybrid idea is incremental.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Long-LRM++, which extends prior feed-forward Gaussian splatting work to handle up to 64 input views at 950x540 resolution for 360-degree scene reconstruction. It replaces direct prediction of millions of Gaussian parameters with a semi-explicit scene representation decoded by a lightweight network, aiming to reduce blurring in fine structures while retaining the speed advantages of explicit methods. The central claims are that Long-LRM++ matches LaCT rendering quality on DL3DV, runs at real-time 14 FPS on A100, scales to longer input sequences, and yields superior novel-view depth estimates on ScanNetv2, supported by ablations.

Significance. If the quality equivalence is substantiated, the work offers a practical advance by reconciling the high fidelity of implicit transformer-based decompression with real-time explicit rendering, which is valuable for applications requiring both wide-coverage reconstruction and interactive speeds. The scaling behavior to 64 views and the depth prediction improvement are concrete strengths that could influence follow-on research in generalizable novel-view synthesis.

major comments (2)

[§4.1 and Table 2] §4.1 and Table 2: the headline claim that Long-LRM++ matches LaCT quality on DL3DV rests on aggregate PSNR/SSIM/LPIPS scores, yet no per-scene high-frequency breakdown (e.g., text or edge regions) or frequency-aware metrics are reported; without these, it remains unclear whether the lightweight decoder fully mitigates the blurring the introduction attributes to explicit Gaussian prediction.
[§3.3 Decoder Architecture] §3.3 Decoder Architecture: the semi-explicit representation plus lightweight decoder is presented as capacity-efficient, but the manuscript provides no parameter count comparison to LaCT's full transformer backbone or ablation isolating decoder depth versus fidelity; this leaves open the possibility that observed quality parity is dataset-specific rather than generally preserved.

minor comments (2)

[Figure 3] Figure 3: the qualitative comparisons would be more convincing with zoomed insets and corresponding error maps focused on fine structures such as signage or foliage.
[§3.2] The notation for the semi-explicit feature volume is introduced without an explicit equation linking it to the subsequent decoder input; adding a short equation in §3.2 would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments and the opportunity to clarify our contributions. We address the major comments point-by-point below, providing additional context from our experiments and committing to revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [§4.1 and Table 2] §4.1 and Table 2: the headline claim that Long-LRM++ matches LaCT quality on DL3DV rests on aggregate PSNR/SSIM/LPIPS scores, yet no per-scene high-frequency breakdown (e.g., text or edge regions) or frequency-aware metrics are reported; without these, it remains unclear whether the lightweight decoder fully mitigates the blurring the introduction attributes to explicit Gaussian prediction.

Authors: We acknowledge that our primary quantitative comparison in §4.1 and Table 2 relies on aggregate metrics across the DL3DV dataset. To support the claim of matching LaCT quality while preserving fine details, the manuscript includes qualitative results in Figure 4 and the appendix, demonstrating reduced blurring in high-frequency elements such as text and edges. LPIPS, being a perceptual metric, is particularly sensitive to such details. We agree that a more granular analysis would be beneficial and will include a per-scene breakdown for high-frequency regions (e.g., selecting scenes with prominent text) in the revised version, along with any additional frequency-domain metrics if feasible. This will further substantiate that the semi-explicit representation with lightweight decoder effectively mitigates the blurring issue. revision: yes
Referee: [§3.3 Decoder Architecture] §3.3 Decoder Architecture: the semi-explicit representation plus lightweight decoder is presented as capacity-efficient, but the manuscript provides no parameter count comparison to LaCT's full transformer backbone or ablation isolating decoder depth versus fidelity; this leaves open the possibility that observed quality parity is dataset-specific rather than generally preserved.

Authors: We appreciate this observation regarding the decoder architecture in §3.3. While the manuscript emphasizes the lightweight nature of the decoder for real-time performance, we did not include explicit parameter counts or depth ablations. In the revision, we will add a comparison of parameter counts between our lightweight decoder and LaCT's full transformer backbone. Furthermore, we will conduct and report an ablation study varying the decoder depth to show its impact on fidelity, confirming that the chosen lightweight configuration achieves quality parity without requiring the full capacity of implicit methods. This will help demonstrate the generalizability of our approach beyond the specific datasets used. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to Long-LRM but central quality/speed claims rest on external comparisons

full rationale

The paper introduces Long-LRM++ as a semi-explicit representation plus lightweight decoder and reports empirical matches to LaCT rendering quality on DL3DV plus 14 FPS real-time performance. These results are obtained via direct benchmarking against external prior methods (LaCT, LVSM) on standard datasets rather than any reduction of the reported gains to quantities defined by the authors' own fitted parameters or self-citations. The only self-reference is the contextual mention of the prior Long-LRM work, which is not load-bearing for the new claims about detail preservation or speed. No equations, uniqueness theorems, or ansatzes are shown to collapse the core assertions back to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard neural network components; all details are deferred to the full manuscript.

pith-pipeline@v0.9.0 · 5604 in / 991 out tokens · 22573 ms · 2026-05-16T23:52:05.393100+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fast Spatial Memory with Elastic Test-Time Training
cs.CV 2026-04 unverdicted novelty 6.0

Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Neural point-based graphics

Kara-Ali Aliev, Artem Sevastopolsky, Maria Kolos, Dmitry Ulyanov, and Victor Lempitsky. Neural point-based graphics. InEuropean conference on computer vision, pages 696–712. Springer, 2020. 2

work page 2020
[2]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19457–19467, 2024. 1, 2

work page 2024
[3]

Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo

Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo. InProceedings of the IEEE/CVF international conference on computer vision, pages 14124–14133, 2021. 2

work page 2021
[4]

Tensorf: Tensorial radiance fields

Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean Conference on Computer Vision (ECCV), 2022. 2

work page 2022
[5]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images, 2024

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images.arXiv preprint arXiv:2403.14627, 2024. 1, 2

work page arXiv 2024
[6]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 2, 5, 6

work page 2017
[7]

Plenoxels: Radiance fields without neural networks

Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5501–5510, 2022. 2

work page 2022
[8]

Query-key normal- ization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for transformers. arXiv preprint arXiv:2010.04245, 2020. 5

work page arXiv 2010
[9]

Lrm: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InThe Twelfth International Conference on Learning Representations, 2024. 2

work page 2024
[10]

2d gaussian splatting for geometrically ac- curate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024. 8

work page 2024
[11]

Longsplat: On- line generalizable 3d gaussian splatting from long sequence images.arXiv preprint arXiv:2507.16144, 2025

Guichen Huang, Ruoyu Wang, Xiangjun Gao, Che Sun, Yuwei Wu, Shenghua Gao, and Yunde Jia. Longsplat: On- line generalizable 3d gaussian splatting from long sequence images.arXiv preprint arXiv:2507.16144, 2025. 2

work page arXiv 2025
[12]

Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024. 2, 3, 5

work page arXiv 2024
[13]

Geonerf: Generalizing nerf with geometry priors

Mohammad Mahdi Johari, Yann Lepoittevin, and Franc ¸ois Fleuret. Geonerf: Generalizing nerf with geometry priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18365–18375, 2022. 2

work page 2022
[14]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023. 1, 5

work page 2023
[15]

Spacetime gaus- sian feature splatting for real-time dynamic view synthesis

Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. Spacetime gaus- sian feature splatting for real-time dynamic view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8508–8520, 2024. 3

work page 2024
[16]

Efficient neural radiance fields for interactive free-viewpoint video

Haotong Lin, Sida Peng, Zhen Xu, Yunzhi Yan, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Efficient neural radiance fields for interactive free-viewpoint video. InSIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 2

work page 2022
[17]

Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160– 22169, 2024. 2, 5

work page 2024
[18]

Neural sparse voxel fields.Advances in Neural Information Processing Systems, 33:15651–15663,

Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields.Advances in Neural Information Processing Systems, 33:15651–15663,

work page
[19]

Fast generalizable gaussian splatting reconstruction from multi-view stereo.arXiv preprint arXiv:2405.12218, 2024

Tianqi Liu, Guangcong Wang, Shoukang Hu, Liao Shen, Xinyi Ye, Yuhang Zang, Zhiguo Cao, Wei Li, and Ziwei Liu. Fast generalizable gaussian splatting reconstruction from multi-view stereo.arXiv preprint arXiv:2405.12218, 2024. 1, 2

work page arXiv 2024
[20]

Neural rays for occlusion-aware image-based rendering

Yuan Liu, Sida Peng, Lingjie Liu, Qianqian Wang, Peng Wang, Christian Theobalt, Xiaowei Zhou, and Wenping Wang. Neural rays for occlusion-aware image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 7824–7833, 2022. 2

work page 2022
[21]

Neural volumes: Learning dynamic renderable volumes from images,

Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751, 2019. 2

work page arXiv 1906
[22]

Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024. 2

work page 2024
[23]

Local light field fusion: Practical view synthe- sis with prescriptive sampling guidelines.ACM Transactions on Graphics (ToG), 38(4):1–14, 2019

Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthe- sis with prescriptive sampling guidelines.ACM Transactions on Graphics (ToG), 38(4):1–14, 2019. 2

work page 2019
[24]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 1, 2

work page 2021
[25]

Instant neural graphics primitives with a multires- olution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

Thomas M¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022. 2

work page 2022
[26]

Feature splatting: Language-driven physics-based scene synthesis and editing,

Ri-Zhao Qiu, Ge Yang, Weijia Zeng, and Xiaolong Wang. Feature splatting: Language-driven physics-based scene syn- thesis and editing.arXiv preprint arXiv:2404.01223, 2024. 3

work page arXiv 2024
[27]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

work page 2021
[28]

Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024

Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024. 2

work page arXiv 2024
[29]

Stable view synthesis

Gernot Riegler and Vladlen Koltun. Stable view synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12216–12225, 2021. 2

work page 2021
[30]

Simplere- con: 3d reconstruction without 3d convolutions

Mohamed Sayed, John Gibson, Jamie Watson, Victor Prisacariu, Michael Firman, and Cl´ement Godard. Simplere- con: 3d reconstruction without 3d convolutions. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2022. 4

work page 2022
[31]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Octo- ber 11-14, 2016, Proceedings, Part III 14, pages 501–518. Springer, 2016. 5

work page 2016
[32]

Deepvox- els: Learning persistent 3d feature embeddings

Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael Zollhofer. Deepvox- els: Learning persistent 3d feature embeddings. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2437–2446, 2019. 2

work page 2019
[33]

Scene representation networks: Continuous 3d-structure- aware neural scene representations.Advances in neural infor- mation processing systems, 32, 2019

Vincent Sitzmann, Michael Zollh¨ofer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure- aware neural scene representations.Advances in neural infor- mation processing systems, 32, 2019. 2

work page 2019
[34]

Generalizable patch-based neural ren- dering

Mohammed Suhail, Carlos Esteves, Leonid Sigal, and Ameesh Makadia. Generalizable patch-based neural ren- dering. InEuropean Conference on Computer Vision, pages 156–174. Springer, 2022. 2

work page 2022
[35]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

De- ferred neural rendering: Image synthesis using neural textures

Justus Thies, Michael Zollh¨ofer, and Matthias Nießner. De- ferred neural rendering: Image synthesis using neural textures. Acm Transactions on Graphics (TOG), 38(4):1–12, 2019. 2

work page 2019
[37]

Sags: Structure-aware 3d gaussian splatting

Evangelos Ververas, Rolandos Alexandros Potamias, Jifei Song, Jiankang Deng, and Stefanos Zafeiriou. Sags: Structure-aware 3d gaussian splatting. InEuropean Con- ference on Computer Vision, pages 221–238. Springer, 2024. 2

work page 2024
[38]

Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction

Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. InThe Twelfth International Conference on Learning Representations, 2023. 2

work page 2023
[39]

Ibrnet: Learning multi-view image-based rendering

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srini- vasan, Howard Zhou, Jonathan T Barron, Ricardo Martin- Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2021. 2

work page 2021
[40]

Depth- splat: Connecting gaussian splatting and depth

Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depth- splat: Connecting gaussian splatting and depth. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 16453–16463, 2025. 1, 2

work page 2025
[41]

Point-nerf: Point- based neural radiance fields

Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point- based neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5438–5448, 2022. 2

work page 2022
[42]

Multi-space neural radiance fields

Ze-Xin Yin, Jiaxiong Qiu, Ming-Ming Cheng, and Bo Ren. Multi-space neural radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12407–12416, 2023. 4

work page 2023
[43]

pixelnerf: Neural radiance fields from one or few images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4578–4587, 2021. 2

work page 2021
[44]

Gs-lrm: Large reconstruction model for 3d gaussian splatting.ArXiv, abs/2404.19702, 2024

Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large recon- struction model for 3d gaussian splatting.arXiv preprint arXiv:2404.19702, 2024. 1, 2, 4

work page arXiv 2024
[45]

Test-Time Training Done Right

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Nerfusion: Fusing radiance fields for large- scale scene reconstruction

Xiaoshuai Zhang, Sai Bi, Kalyan Sunkavalli, Hao Su, and Zexiang Xu. Nerfusion: Fusing radiance fields for large- scale scene reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5449–5458, 2022. 2

work page 2022
[47]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 3

work page 2024
[48]

Autofocusformer: Image segmentation off the grid

Chen Ziwen, Kaushik Patnaik, Shuangfei Zhai, Alvin Wan, Zhile Ren, Alexander G Schwing, Alex Colburn, and Li Fuxin. Autofocusformer: Image segmentation off the grid. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18227–18236, 2023. 4

work page 2023
[49]

Ziwen, H

Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. arXiv preprint 2410.12781, 2024. 1, 2, 3, 4, 5 Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction Supplementary Material

work page arXiv 2024
[50]

More implementation details Due to its semi-explicit formulation, Long-LRM++ exhibits a stronger tendency to overfit to input frames when training on mixed sets of input and unseen target frames. This ef- fect becomes more pronounced on datasets such as DL3DV , where neighboring frames have relatively large pose differ- ences—that is, the effective frame ...

work page

[1] [1]

Neural point-based graphics

Kara-Ali Aliev, Artem Sevastopolsky, Maria Kolos, Dmitry Ulyanov, and Victor Lempitsky. Neural point-based graphics. InEuropean conference on computer vision, pages 696–712. Springer, 2020. 2

work page 2020

[2] [2]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19457–19467, 2024. 1, 2

work page 2024

[3] [3]

Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo

Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo. InProceedings of the IEEE/CVF international conference on computer vision, pages 14124–14133, 2021. 2

work page 2021

[4] [4]

Tensorf: Tensorial radiance fields

Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean Conference on Computer Vision (ECCV), 2022. 2

work page 2022

[5] [5]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images, 2024

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images.arXiv preprint arXiv:2403.14627, 2024. 1, 2

work page arXiv 2024

[6] [6]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 2, 5, 6

work page 2017

[7] [7]

Plenoxels: Radiance fields without neural networks

Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5501–5510, 2022. 2

work page 2022

[8] [8]

Query-key normal- ization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for transformers. arXiv preprint arXiv:2010.04245, 2020. 5

work page arXiv 2010

[9] [9]

Lrm: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InThe Twelfth International Conference on Learning Representations, 2024. 2

work page 2024

[10] [10]

2d gaussian splatting for geometrically ac- curate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024. 8

work page 2024

[11] [11]

Longsplat: On- line generalizable 3d gaussian splatting from long sequence images.arXiv preprint arXiv:2507.16144, 2025

Guichen Huang, Ruoyu Wang, Xiangjun Gao, Che Sun, Yuwei Wu, Shenghua Gao, and Yunde Jia. Longsplat: On- line generalizable 3d gaussian splatting from long sequence images.arXiv preprint arXiv:2507.16144, 2025. 2

work page arXiv 2025

[12] [12]

Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024. 2, 3, 5

work page arXiv 2024

[13] [13]

Geonerf: Generalizing nerf with geometry priors

Mohammad Mahdi Johari, Yann Lepoittevin, and Franc ¸ois Fleuret. Geonerf: Generalizing nerf with geometry priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18365–18375, 2022. 2

work page 2022

[14] [14]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023. 1, 5

work page 2023

[15] [15]

Spacetime gaus- sian feature splatting for real-time dynamic view synthesis

Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. Spacetime gaus- sian feature splatting for real-time dynamic view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8508–8520, 2024. 3

work page 2024

[16] [16]

Efficient neural radiance fields for interactive free-viewpoint video

Haotong Lin, Sida Peng, Zhen Xu, Yunzhi Yan, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Efficient neural radiance fields for interactive free-viewpoint video. InSIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 2

work page 2022

[17] [17]

Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160– 22169, 2024. 2, 5

work page 2024

[18] [18]

Neural sparse voxel fields.Advances in Neural Information Processing Systems, 33:15651–15663,

Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields.Advances in Neural Information Processing Systems, 33:15651–15663,

work page

[19] [19]

Fast generalizable gaussian splatting reconstruction from multi-view stereo.arXiv preprint arXiv:2405.12218, 2024

Tianqi Liu, Guangcong Wang, Shoukang Hu, Liao Shen, Xinyi Ye, Yuhang Zang, Zhiguo Cao, Wei Li, and Ziwei Liu. Fast generalizable gaussian splatting reconstruction from multi-view stereo.arXiv preprint arXiv:2405.12218, 2024. 1, 2

work page arXiv 2024

[20] [20]

Neural rays for occlusion-aware image-based rendering

Yuan Liu, Sida Peng, Lingjie Liu, Qianqian Wang, Peng Wang, Christian Theobalt, Xiaowei Zhou, and Wenping Wang. Neural rays for occlusion-aware image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 7824–7833, 2022. 2

work page 2022

[21] [21]

Neural volumes: Learning dynamic renderable volumes from images,

Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751, 2019. 2

work page arXiv 1906

[22] [22]

Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024. 2

work page 2024

[23] [23]

Local light field fusion: Practical view synthe- sis with prescriptive sampling guidelines.ACM Transactions on Graphics (ToG), 38(4):1–14, 2019

Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthe- sis with prescriptive sampling guidelines.ACM Transactions on Graphics (ToG), 38(4):1–14, 2019. 2

work page 2019

[24] [24]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 1, 2

work page 2021

[25] [25]

Instant neural graphics primitives with a multires- olution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

Thomas M¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022. 2

work page 2022

[26] [26]

Feature splatting: Language-driven physics-based scene synthesis and editing,

Ri-Zhao Qiu, Ge Yang, Weijia Zeng, and Xiaolong Wang. Feature splatting: Language-driven physics-based scene syn- thesis and editing.arXiv preprint arXiv:2404.01223, 2024. 3

work page arXiv 2024

[27] [27]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

work page 2021

[28] [28]

Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024

Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024. 2

work page arXiv 2024

[29] [29]

Stable view synthesis

Gernot Riegler and Vladlen Koltun. Stable view synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12216–12225, 2021. 2

work page 2021

[30] [30]

Simplere- con: 3d reconstruction without 3d convolutions

Mohamed Sayed, John Gibson, Jamie Watson, Victor Prisacariu, Michael Firman, and Cl´ement Godard. Simplere- con: 3d reconstruction without 3d convolutions. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2022. 4

work page 2022

[31] [31]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Octo- ber 11-14, 2016, Proceedings, Part III 14, pages 501–518. Springer, 2016. 5

work page 2016

[32] [32]

Deepvox- els: Learning persistent 3d feature embeddings

Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael Zollhofer. Deepvox- els: Learning persistent 3d feature embeddings. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2437–2446, 2019. 2

work page 2019

[33] [33]

Scene representation networks: Continuous 3d-structure- aware neural scene representations.Advances in neural infor- mation processing systems, 32, 2019

Vincent Sitzmann, Michael Zollh¨ofer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure- aware neural scene representations.Advances in neural infor- mation processing systems, 32, 2019. 2

work page 2019

[34] [34]

Generalizable patch-based neural ren- dering

Mohammed Suhail, Carlos Esteves, Leonid Sigal, and Ameesh Makadia. Generalizable patch-based neural ren- dering. InEuropean Conference on Computer Vision, pages 156–174. Springer, 2022. 2

work page 2022

[35] [35]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

De- ferred neural rendering: Image synthesis using neural textures

Justus Thies, Michael Zollh¨ofer, and Matthias Nießner. De- ferred neural rendering: Image synthesis using neural textures. Acm Transactions on Graphics (TOG), 38(4):1–12, 2019. 2

work page 2019

[37] [37]

Sags: Structure-aware 3d gaussian splatting

Evangelos Ververas, Rolandos Alexandros Potamias, Jifei Song, Jiankang Deng, and Stefanos Zafeiriou. Sags: Structure-aware 3d gaussian splatting. InEuropean Con- ference on Computer Vision, pages 221–238. Springer, 2024. 2

work page 2024

[38] [38]

Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction

Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. InThe Twelfth International Conference on Learning Representations, 2023. 2

work page 2023

[39] [39]

Ibrnet: Learning multi-view image-based rendering

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srini- vasan, Howard Zhou, Jonathan T Barron, Ricardo Martin- Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2021. 2

work page 2021

[40] [40]

Depth- splat: Connecting gaussian splatting and depth

Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depth- splat: Connecting gaussian splatting and depth. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 16453–16463, 2025. 1, 2

work page 2025

[41] [41]

Point-nerf: Point- based neural radiance fields

Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point- based neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5438–5448, 2022. 2

work page 2022

[42] [42]

Multi-space neural radiance fields

Ze-Xin Yin, Jiaxiong Qiu, Ming-Ming Cheng, and Bo Ren. Multi-space neural radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12407–12416, 2023. 4

work page 2023

[43] [43]

pixelnerf: Neural radiance fields from one or few images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4578–4587, 2021. 2

work page 2021

[44] [44]

Gs-lrm: Large reconstruction model for 3d gaussian splatting.ArXiv, abs/2404.19702, 2024

Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large recon- struction model for 3d gaussian splatting.arXiv preprint arXiv:2404.19702, 2024. 1, 2, 4

work page arXiv 2024

[45] [45]

Test-Time Training Done Right

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Nerfusion: Fusing radiance fields for large- scale scene reconstruction

Xiaoshuai Zhang, Sai Bi, Kalyan Sunkavalli, Hao Su, and Zexiang Xu. Nerfusion: Fusing radiance fields for large- scale scene reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5449–5458, 2022. 2

work page 2022

[47] [47]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 3

work page 2024

[48] [48]

Autofocusformer: Image segmentation off the grid

Chen Ziwen, Kaushik Patnaik, Shuangfei Zhai, Alvin Wan, Zhile Ren, Alexander G Schwing, Alex Colburn, and Li Fuxin. Autofocusformer: Image segmentation off the grid. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18227–18236, 2023. 4

work page 2023

[49] [49]

Ziwen, H

Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. arXiv preprint 2410.12781, 2024. 1, 2, 3, 4, 5 Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction Supplementary Material

work page arXiv 2024

[50] [50]

More implementation details Due to its semi-explicit formulation, Long-LRM++ exhibits a stronger tendency to overfit to input frames when training on mixed sets of input and unseen target frames. This ef- fect becomes more pronounced on datasets such as DL3DV , where neighboring frames have relatively large pose differ- ences—that is, the effective frame ...

work page