pith. sign in

arxiv: 2512.10267 · v2 · submitted 2025-12-11 · 💻 cs.CV

Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction

Pith reviewed 2026-05-16 23:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords feed-forward reconstructionGaussian splattingreal-time renderingnovel view synthesisscene reconstructiondepth predictionimplicit representations
0
0 comments X

The pith

A semi-explicit scene representation with a lightweight decoder preserves fine details in feed-forward wide-coverage reconstruction while delivering real-time performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Long-LRM++ to resolve the conflict between high-fidelity rendering and speed in generalizable Gaussian splatting methods that handle dozens of input views. Earlier explicit approaches that directly output millions of Gaussian parameters suffer from blurring in fine structures such as text, while implicit methods achieve better detail by embedding scene information in model weights but require expensive per-frame decompression. Long-LRM++ instead uses a semi-explicit representation decoded by a lightweight network, retaining the quality of full implicit models such as LaCT yet reaching 14 FPS on an A100 GPU. The design also extends to 64 input views at 950 by 540 resolution and improves novel-view depth accuracy over direct Gaussian depth rendering.

Core claim

Long-LRM++ adopts a semi-explicit scene representation combined with a lightweight decoder. This design matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU. The approach also scales to 64 input views at the 950 by 540 resolution and delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians.

What carries the argument

Semi-explicit scene representation decoded by a lightweight network that avoids full transformer-based decompression for each rendered frame.

If this is right

  • Direct prediction of millions of Gaussian parameters can be replaced without sacrificing detail fidelity.
  • Real-time 360-degree scene reconstruction becomes practical from 32 or 64 input views in a single forward pass.
  • Novel-view depth maps can be obtained more accurately than by rendering from explicit Gaussians.
  • The same semi-explicit plus lightweight-decoder pattern can support higher input resolutions or longer view sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may reduce sensitivity to input view ordering or camera calibration errors compared with purely explicit methods.
  • Real-time capability could enable online reconstruction pipelines in robotics or mobile AR where latency matters.
  • The design suggests that many implicit decompression steps can be pre-computed into a compact explicit layer without quality loss.

Load-bearing premise

A lightweight decoder on top of a semi-explicit representation can preserve fine details as effectively as full transformer-based implicit decompression without introducing new blurring artifacts.

What would settle it

A side-by-side rendering test on DL3DV scenes containing text or fine structures where Long-LRM++ shows measurable blurring or loss of sharpness that LaCT does not.

Figures

Figures reproduced from arXiv: 2512.10267 by Chen Ziwen, Hao Tan, Li Fuxin, Peng Wang, Zexiang Xu.

Figure 1
Figure 1. Figure 1: We present Long-LRM++, a feed-forward novel-view [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Long-LRM++ architecture. Long-LRM++ takes up to 64 input images at 950 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of novel-view rendering on DL3DV (32-input, 960 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of novel-view color and depth rendering on ScanNetv2 (128-input, 448 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at $950\times540$ resolution, achieving 360{\deg} scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential "decompression" process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the $950\times540$ resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Long-LRM++, which extends prior feed-forward Gaussian splatting work to handle up to 64 input views at 950x540 resolution for 360-degree scene reconstruction. It replaces direct prediction of millions of Gaussian parameters with a semi-explicit scene representation decoded by a lightweight network, aiming to reduce blurring in fine structures while retaining the speed advantages of explicit methods. The central claims are that Long-LRM++ matches LaCT rendering quality on DL3DV, runs at real-time 14 FPS on A100, scales to longer input sequences, and yields superior novel-view depth estimates on ScanNetv2, supported by ablations.

Significance. If the quality equivalence is substantiated, the work offers a practical advance by reconciling the high fidelity of implicit transformer-based decompression with real-time explicit rendering, which is valuable for applications requiring both wide-coverage reconstruction and interactive speeds. The scaling behavior to 64 views and the depth prediction improvement are concrete strengths that could influence follow-on research in generalizable novel-view synthesis.

major comments (2)
  1. [§4.1 and Table 2] §4.1 and Table 2: the headline claim that Long-LRM++ matches LaCT quality on DL3DV rests on aggregate PSNR/SSIM/LPIPS scores, yet no per-scene high-frequency breakdown (e.g., text or edge regions) or frequency-aware metrics are reported; without these, it remains unclear whether the lightweight decoder fully mitigates the blurring the introduction attributes to explicit Gaussian prediction.
  2. [§3.3 Decoder Architecture] §3.3 Decoder Architecture: the semi-explicit representation plus lightweight decoder is presented as capacity-efficient, but the manuscript provides no parameter count comparison to LaCT's full transformer backbone or ablation isolating decoder depth versus fidelity; this leaves open the possibility that observed quality parity is dataset-specific rather than generally preserved.
minor comments (2)
  1. [Figure 3] Figure 3: the qualitative comparisons would be more convincing with zoomed insets and corresponding error maps focused on fine structures such as signage or foliage.
  2. [§3.2] The notation for the semi-explicit feature volume is introduced without an explicit equation linking it to the subsequent decoder input; adding a short equation in §3.2 would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments and the opportunity to clarify our contributions. We address the major comments point-by-point below, providing additional context from our experiments and committing to revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4.1 and Table 2] §4.1 and Table 2: the headline claim that Long-LRM++ matches LaCT quality on DL3DV rests on aggregate PSNR/SSIM/LPIPS scores, yet no per-scene high-frequency breakdown (e.g., text or edge regions) or frequency-aware metrics are reported; without these, it remains unclear whether the lightweight decoder fully mitigates the blurring the introduction attributes to explicit Gaussian prediction.

    Authors: We acknowledge that our primary quantitative comparison in §4.1 and Table 2 relies on aggregate metrics across the DL3DV dataset. To support the claim of matching LaCT quality while preserving fine details, the manuscript includes qualitative results in Figure 4 and the appendix, demonstrating reduced blurring in high-frequency elements such as text and edges. LPIPS, being a perceptual metric, is particularly sensitive to such details. We agree that a more granular analysis would be beneficial and will include a per-scene breakdown for high-frequency regions (e.g., selecting scenes with prominent text) in the revised version, along with any additional frequency-domain metrics if feasible. This will further substantiate that the semi-explicit representation with lightweight decoder effectively mitigates the blurring issue. revision: yes

  2. Referee: [§3.3 Decoder Architecture] §3.3 Decoder Architecture: the semi-explicit representation plus lightweight decoder is presented as capacity-efficient, but the manuscript provides no parameter count comparison to LaCT's full transformer backbone or ablation isolating decoder depth versus fidelity; this leaves open the possibility that observed quality parity is dataset-specific rather than generally preserved.

    Authors: We appreciate this observation regarding the decoder architecture in §3.3. While the manuscript emphasizes the lightweight nature of the decoder for real-time performance, we did not include explicit parameter counts or depth ablations. In the revision, we will add a comparison of parameter counts between our lightweight decoder and LaCT's full transformer backbone. Furthermore, we will conduct and report an ablation study varying the decoder depth to show its impact on fidelity, confirming that the chosen lightweight configuration achieves quality parity without requiring the full capacity of implicit methods. This will help demonstrate the generalizability of our approach beyond the specific datasets used. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to Long-LRM but central quality/speed claims rest on external comparisons

full rationale

The paper introduces Long-LRM++ as a semi-explicit representation plus lightweight decoder and reports empirical matches to LaCT rendering quality on DL3DV plus 14 FPS real-time performance. These results are obtained via direct benchmarking against external prior methods (LaCT, LVSM) on standard datasets rather than any reduction of the reported gains to quantities defined by the authors' own fitted parameters or self-citations. The only self-reference is the contextual mention of the prior Long-LRM work, which is not load-bearing for the new claims about detail preservation or speed. No equations, uniqueness theorems, or ansatzes are shown to collapse the core assertions back to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard neural network components; all details are deferred to the full manuscript.

pith-pipeline@v0.9.0 · 5604 in / 991 out tokens · 22573 ms · 2026-05-16T23:52:05.393100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Fast Spatial Memory with Elastic Test-Time Training

    cs.CV 2026-04 unverdicted novelty 6.0

    Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Neural point-based graphics

    Kara-Ali Aliev, Artem Sevastopolsky, Maria Kolos, Dmitry Ulyanov, and Victor Lempitsky. Neural point-based graphics. InEuropean conference on computer vision, pages 696–712. Springer, 2020. 2

  2. [2]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

    David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19457–19467, 2024. 1, 2

  3. [3]

    Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo

    Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo. InProceedings of the IEEE/CVF international conference on computer vision, pages 14124–14133, 2021. 2

  4. [4]

    Tensorf: Tensorial radiance fields

    Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean Conference on Computer Vision (ECCV), 2022. 2

  5. [5]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images, 2024

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images.arXiv preprint arXiv:2403.14627, 2024. 1, 2

  6. [6]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 2, 5, 6

  7. [7]

    Plenoxels: Radiance fields without neural networks

    Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5501–5510, 2022. 2

  8. [8]

    Query-key normal- ization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for transformers. arXiv preprint arXiv:2010.04245, 2020. 5

  9. [9]

    Lrm: Large reconstruction model for single image to 3d

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InThe Twelfth International Conference on Learning Representations, 2024. 2

  10. [10]

    2d gaussian splatting for geometrically ac- curate radiance fields

    Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024. 8

  11. [11]

    Longsplat: On- line generalizable 3d gaussian splatting from long sequence images.arXiv preprint arXiv:2507.16144, 2025

    Guichen Huang, Ruoyu Wang, Xiangjun Gao, Che Sun, Yuwei Wu, Shenghua Gao, and Yunde Jia. Longsplat: On- line generalizable 3d gaussian splatting from long sequence images.arXiv preprint arXiv:2507.16144, 2025. 2

  12. [12]

    Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024

    Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024. 2, 3, 5

  13. [13]

    Geonerf: Generalizing nerf with geometry priors

    Mohammad Mahdi Johari, Yann Lepoittevin, and Franc ¸ois Fleuret. Geonerf: Generalizing nerf with geometry priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18365–18375, 2022. 2

  14. [14]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023. 1, 5

  15. [15]

    Spacetime gaus- sian feature splatting for real-time dynamic view synthesis

    Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. Spacetime gaus- sian feature splatting for real-time dynamic view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8508–8520, 2024. 3

  16. [16]

    Efficient neural radiance fields for interactive free-viewpoint video

    Haotong Lin, Sida Peng, Zhen Xu, Yunzhi Yan, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Efficient neural radiance fields for interactive free-viewpoint video. InSIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 2

  17. [17]

    Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160– 22169, 2024. 2, 5

  18. [18]

    Neural sparse voxel fields.Advances in Neural Information Processing Systems, 33:15651–15663,

    Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields.Advances in Neural Information Processing Systems, 33:15651–15663,

  19. [19]

    Fast generalizable gaussian splatting reconstruction from multi-view stereo.arXiv preprint arXiv:2405.12218, 2024

    Tianqi Liu, Guangcong Wang, Shoukang Hu, Liao Shen, Xinyi Ye, Yuhang Zang, Zhiguo Cao, Wei Li, and Ziwei Liu. Fast generalizable gaussian splatting reconstruction from multi-view stereo.arXiv preprint arXiv:2405.12218, 2024. 1, 2

  20. [20]

    Neural rays for occlusion-aware image-based rendering

    Yuan Liu, Sida Peng, Lingjie Liu, Qianqian Wang, Peng Wang, Christian Theobalt, Xiaowei Zhou, and Wenping Wang. Neural rays for occlusion-aware image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 7824–7833, 2022. 2

  21. [21]

    Neural volumes: Learning dynamic renderable volumes from images,

    Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751, 2019. 2

  22. [22]

    Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

    Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024. 2

  23. [23]

    Local light field fusion: Practical view synthe- sis with prescriptive sampling guidelines.ACM Transactions on Graphics (ToG), 38(4):1–14, 2019

    Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthe- sis with prescriptive sampling guidelines.ACM Transactions on Graphics (ToG), 38(4):1–14, 2019. 2

  24. [24]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 1, 2

  25. [25]

    Instant neural graphics primitives with a multires- olution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

    Thomas M¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022. 2

  26. [26]

    Feature splatting: Language-driven physics-based scene synthesis and editing,

    Ri-Zhao Qiu, Ge Yang, Weijia Zeng, and Xiaolong Wang. Feature splatting: Language-driven physics-based scene syn- thesis and editing.arXiv preprint arXiv:2404.01223, 2024. 3

  27. [27]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

  28. [28]

    Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024

    Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024. 2

  29. [29]

    Stable view synthesis

    Gernot Riegler and Vladlen Koltun. Stable view synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12216–12225, 2021. 2

  30. [30]

    Simplere- con: 3d reconstruction without 3d convolutions

    Mohamed Sayed, John Gibson, Jamie Watson, Victor Prisacariu, Michael Firman, and Cl´ement Godard. Simplere- con: 3d reconstruction without 3d convolutions. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2022. 4

  31. [31]

    Pixelwise view selection for unstructured multi-view stereo

    Johannes L Sch¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Octo- ber 11-14, 2016, Proceedings, Part III 14, pages 501–518. Springer, 2016. 5

  32. [32]

    Deepvox- els: Learning persistent 3d feature embeddings

    Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael Zollhofer. Deepvox- els: Learning persistent 3d feature embeddings. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2437–2446, 2019. 2

  33. [33]

    Scene representation networks: Continuous 3d-structure- aware neural scene representations.Advances in neural infor- mation processing systems, 32, 2019

    Vincent Sitzmann, Michael Zollh¨ofer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure- aware neural scene representations.Advances in neural infor- mation processing systems, 32, 2019. 2

  34. [34]

    Generalizable patch-based neural ren- dering

    Mohammed Suhail, Carlos Esteves, Leonid Sigal, and Ameesh Makadia. Generalizable patch-based neural ren- dering. InEuropean Conference on Computer Vision, pages 156–174. Springer, 2022. 2

  35. [35]

    Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024. 5

  36. [36]

    De- ferred neural rendering: Image synthesis using neural textures

    Justus Thies, Michael Zollh¨ofer, and Matthias Nießner. De- ferred neural rendering: Image synthesis using neural textures. Acm Transactions on Graphics (TOG), 38(4):1–12, 2019. 2

  37. [37]

    Sags: Structure-aware 3d gaussian splatting

    Evangelos Ververas, Rolandos Alexandros Potamias, Jifei Song, Jiankang Deng, and Stefanos Zafeiriou. Sags: Structure-aware 3d gaussian splatting. InEuropean Con- ference on Computer Vision, pages 221–238. Springer, 2024. 2

  38. [38]

    Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction

    Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. InThe Twelfth International Conference on Learning Representations, 2023. 2

  39. [39]

    Ibrnet: Learning multi-view image-based rendering

    Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srini- vasan, Howard Zhou, Jonathan T Barron, Ricardo Martin- Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2021. 2

  40. [40]

    Depth- splat: Connecting gaussian splatting and depth

    Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depth- splat: Connecting gaussian splatting and depth. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 16453–16463, 2025. 1, 2

  41. [41]

    Point-nerf: Point- based neural radiance fields

    Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point- based neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5438–5448, 2022. 2

  42. [42]

    Multi-space neural radiance fields

    Ze-Xin Yin, Jiaxiong Qiu, Ming-Ming Cheng, and Bo Ren. Multi-space neural radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12407–12416, 2023. 4

  43. [43]

    pixelnerf: Neural radiance fields from one or few images

    Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4578–4587, 2021. 2

  44. [44]

    Gs-lrm: Large reconstruction model for 3d gaussian splatting.ArXiv, abs/2404.19702, 2024

    Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large recon- struction model for 3d gaussian splatting.arXiv preprint arXiv:2404.19702, 2024. 1, 2, 4

  45. [45]

    Test-Time Training Done Right

    Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025. 2, 3, 5

  46. [46]

    Nerfusion: Fusing radiance fields for large- scale scene reconstruction

    Xiaoshuai Zhang, Sai Bi, Kalyan Sunkavalli, Hao Su, and Zexiang Xu. Nerfusion: Fusing radiance fields for large- scale scene reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5449–5458, 2022. 2

  47. [47]

    Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

    Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 3

  48. [48]

    Autofocusformer: Image segmentation off the grid

    Chen Ziwen, Kaushik Patnaik, Shuangfei Zhai, Alvin Wan, Zhile Ren, Alexander G Schwing, Alex Colburn, and Li Fuxin. Autofocusformer: Image segmentation off the grid. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18227–18236, 2023. 4

  49. [49]

    Ziwen, H

    Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. arXiv preprint 2410.12781, 2024. 1, 2, 3, 4, 5 Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction Supplementary Material

  50. [50]

    More implementation details Due to its semi-explicit formulation, Long-LRM++ exhibits a stronger tendency to overfit to input frames when training on mixed sets of input and unseen target frames. This ef- fect becomes more pronounced on datasets such as DL3DV , where neighboring frames have relatively large pose differ- ences—that is, the effective frame ...