Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction
Pith reviewed 2026-05-16 23:52 UTC · model grok-4.3
The pith
A semi-explicit scene representation with a lightweight decoder preserves fine details in feed-forward wide-coverage reconstruction while delivering real-time performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Long-LRM++ adopts a semi-explicit scene representation combined with a lightweight decoder. This design matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU. The approach also scales to 64 input views at the 950 by 540 resolution and delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians.
What carries the argument
Semi-explicit scene representation decoded by a lightweight network that avoids full transformer-based decompression for each rendered frame.
If this is right
- Direct prediction of millions of Gaussian parameters can be replaced without sacrificing detail fidelity.
- Real-time 360-degree scene reconstruction becomes practical from 32 or 64 input views in a single forward pass.
- Novel-view depth maps can be obtained more accurately than by rendering from explicit Gaussians.
- The same semi-explicit plus lightweight-decoder pattern can support higher input resolutions or longer view sequences.
Where Pith is reading between the lines
- The approach may reduce sensitivity to input view ordering or camera calibration errors compared with purely explicit methods.
- Real-time capability could enable online reconstruction pipelines in robotics or mobile AR where latency matters.
- The design suggests that many implicit decompression steps can be pre-computed into a compact explicit layer without quality loss.
Load-bearing premise
A lightweight decoder on top of a semi-explicit representation can preserve fine details as effectively as full transformer-based implicit decompression without introducing new blurring artifacts.
What would settle it
A side-by-side rendering test on DL3DV scenes containing text or fine structures where Long-LRM++ shows measurable blurring or loss of sharpness that LaCT does not.
Figures
read the original abstract
Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at $950\times540$ resolution, achieving 360{\deg} scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential "decompression" process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the $950\times540$ resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Long-LRM++, which extends prior feed-forward Gaussian splatting work to handle up to 64 input views at 950x540 resolution for 360-degree scene reconstruction. It replaces direct prediction of millions of Gaussian parameters with a semi-explicit scene representation decoded by a lightweight network, aiming to reduce blurring in fine structures while retaining the speed advantages of explicit methods. The central claims are that Long-LRM++ matches LaCT rendering quality on DL3DV, runs at real-time 14 FPS on A100, scales to longer input sequences, and yields superior novel-view depth estimates on ScanNetv2, supported by ablations.
Significance. If the quality equivalence is substantiated, the work offers a practical advance by reconciling the high fidelity of implicit transformer-based decompression with real-time explicit rendering, which is valuable for applications requiring both wide-coverage reconstruction and interactive speeds. The scaling behavior to 64 views and the depth prediction improvement are concrete strengths that could influence follow-on research in generalizable novel-view synthesis.
major comments (2)
- [§4.1 and Table 2] §4.1 and Table 2: the headline claim that Long-LRM++ matches LaCT quality on DL3DV rests on aggregate PSNR/SSIM/LPIPS scores, yet no per-scene high-frequency breakdown (e.g., text or edge regions) or frequency-aware metrics are reported; without these, it remains unclear whether the lightweight decoder fully mitigates the blurring the introduction attributes to explicit Gaussian prediction.
- [§3.3 Decoder Architecture] §3.3 Decoder Architecture: the semi-explicit representation plus lightweight decoder is presented as capacity-efficient, but the manuscript provides no parameter count comparison to LaCT's full transformer backbone or ablation isolating decoder depth versus fidelity; this leaves open the possibility that observed quality parity is dataset-specific rather than generally preserved.
minor comments (2)
- [Figure 3] Figure 3: the qualitative comparisons would be more convincing with zoomed insets and corresponding error maps focused on fine structures such as signage or foliage.
- [§3.2] The notation for the semi-explicit feature volume is introduced without an explicit equation linking it to the subsequent decoder input; adding a short equation in §3.2 would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments and the opportunity to clarify our contributions. We address the major comments point-by-point below, providing additional context from our experiments and committing to revisions where appropriate to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4.1 and Table 2] §4.1 and Table 2: the headline claim that Long-LRM++ matches LaCT quality on DL3DV rests on aggregate PSNR/SSIM/LPIPS scores, yet no per-scene high-frequency breakdown (e.g., text or edge regions) or frequency-aware metrics are reported; without these, it remains unclear whether the lightweight decoder fully mitigates the blurring the introduction attributes to explicit Gaussian prediction.
Authors: We acknowledge that our primary quantitative comparison in §4.1 and Table 2 relies on aggregate metrics across the DL3DV dataset. To support the claim of matching LaCT quality while preserving fine details, the manuscript includes qualitative results in Figure 4 and the appendix, demonstrating reduced blurring in high-frequency elements such as text and edges. LPIPS, being a perceptual metric, is particularly sensitive to such details. We agree that a more granular analysis would be beneficial and will include a per-scene breakdown for high-frequency regions (e.g., selecting scenes with prominent text) in the revised version, along with any additional frequency-domain metrics if feasible. This will further substantiate that the semi-explicit representation with lightweight decoder effectively mitigates the blurring issue. revision: yes
-
Referee: [§3.3 Decoder Architecture] §3.3 Decoder Architecture: the semi-explicit representation plus lightweight decoder is presented as capacity-efficient, but the manuscript provides no parameter count comparison to LaCT's full transformer backbone or ablation isolating decoder depth versus fidelity; this leaves open the possibility that observed quality parity is dataset-specific rather than generally preserved.
Authors: We appreciate this observation regarding the decoder architecture in §3.3. While the manuscript emphasizes the lightweight nature of the decoder for real-time performance, we did not include explicit parameter counts or depth ablations. In the revision, we will add a comparison of parameter counts between our lightweight decoder and LaCT's full transformer backbone. Furthermore, we will conduct and report an ablation study varying the decoder depth to show its impact on fidelity, confirming that the chosen lightweight configuration achieves quality parity without requiring the full capacity of implicit methods. This will help demonstrate the generalizability of our approach beyond the specific datasets used. revision: yes
Circularity Check
Minor self-citation to Long-LRM but central quality/speed claims rest on external comparisons
full rationale
The paper introduces Long-LRM++ as a semi-explicit representation plus lightweight decoder and reports empirical matches to LaCT rendering quality on DL3DV plus 14 FPS real-time performance. These results are obtained via direct benchmarking against external prior methods (LaCT, LVSM) on standard datasets rather than any reduction of the reported gains to quantities defined by the authors' own fitted parameters or self-citations. The only self-reference is the contextual mention of the prior Long-LRM work, which is not load-bearing for the new claims about detail preservation or speed. No equations, uniqueness theorems, or ansatzes are shown to collapse the core assertions back to the inputs by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Fast Spatial Memory with Elastic Test-Time Training
Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.
Reference graph
Works this paper leans on
-
[1]
Kara-Ali Aliev, Artem Sevastopolsky, Maria Kolos, Dmitry Ulyanov, and Victor Lempitsky. Neural point-based graphics. InEuropean conference on computer vision, pages 696–712. Springer, 2020. 2
work page 2020
-
[2]
pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction
David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19457–19467, 2024. 1, 2
work page 2024
-
[3]
Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo
Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo. InProceedings of the IEEE/CVF international conference on computer vision, pages 14124–14133, 2021. 2
work page 2021
-
[4]
Tensorf: Tensorial radiance fields
Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean Conference on Computer Vision (ECCV), 2022. 2
work page 2022
-
[5]
Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images, 2024
Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images.arXiv preprint arXiv:2403.14627, 2024. 1, 2
-
[6]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 2, 5, 6
work page 2017
-
[7]
Plenoxels: Radiance fields without neural networks
Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5501–5510, 2022. 2
work page 2022
-
[8]
Query-key normal- ization for transformers
Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for transformers. arXiv preprint arXiv:2010.04245, 2020. 5
-
[9]
Lrm: Large reconstruction model for single image to 3d
Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InThe Twelfth International Conference on Learning Representations, 2024. 2
work page 2024
-
[10]
2d gaussian splatting for geometrically ac- curate radiance fields
Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024. 8
work page 2024
-
[11]
Guichen Huang, Ruoyu Wang, Xiangjun Gao, Che Sun, Yuwei Wu, Shenghua Gao, and Yunde Jia. Longsplat: On- line generalizable 3d gaussian splatting from long sequence images.arXiv preprint arXiv:2507.16144, 2025. 2
-
[12]
Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024. 2, 3, 5
-
[13]
Geonerf: Generalizing nerf with geometry priors
Mohammad Mahdi Johari, Yann Lepoittevin, and Franc ¸ois Fleuret. Geonerf: Generalizing nerf with geometry priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18365–18375, 2022. 2
work page 2022
-
[14]
3d gaussian splatting for real-time radiance field rendering.ACM Trans
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023. 1, 5
work page 2023
-
[15]
Spacetime gaus- sian feature splatting for real-time dynamic view synthesis
Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. Spacetime gaus- sian feature splatting for real-time dynamic view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8508–8520, 2024. 3
work page 2024
-
[16]
Efficient neural radiance fields for interactive free-viewpoint video
Haotong Lin, Sida Peng, Zhen Xu, Yunzhi Yan, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Efficient neural radiance fields for interactive free-viewpoint video. InSIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 2
work page 2022
-
[17]
Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160– 22169, 2024. 2, 5
work page 2024
-
[18]
Neural sparse voxel fields.Advances in Neural Information Processing Systems, 33:15651–15663,
Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields.Advances in Neural Information Processing Systems, 33:15651–15663,
-
[19]
Tianqi Liu, Guangcong Wang, Shoukang Hu, Liao Shen, Xinyi Ye, Yuhang Zang, Zhiguo Cao, Wei Li, and Ziwei Liu. Fast generalizable gaussian splatting reconstruction from multi-view stereo.arXiv preprint arXiv:2405.12218, 2024. 1, 2
-
[20]
Neural rays for occlusion-aware image-based rendering
Yuan Liu, Sida Peng, Lingjie Liu, Qianqian Wang, Peng Wang, Christian Theobalt, Xiaowei Zhou, and Wenping Wang. Neural rays for occlusion-aware image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 7824–7833, 2022. 2
work page 2022
-
[21]
Neural volumes: Learning dynamic renderable volumes from images,
Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751, 2019. 2
-
[22]
Scaffold-gs: Structured 3d gaussians for view-adaptive rendering
Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024. 2
work page 2024
-
[23]
Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthe- sis with prescriptive sampling guidelines.ACM Transactions on Graphics (ToG), 38(4):1–14, 2019. 2
work page 2019
-
[24]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 1, 2
work page 2021
-
[25]
Thomas M¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022. 2
work page 2022
-
[26]
Feature splatting: Language-driven physics-based scene synthesis and editing,
Ri-Zhao Qiu, Ge Yang, Weijia Zeng, and Xiaolong Wang. Feature splatting: Language-driven physics-based scene syn- thesis and editing.arXiv preprint arXiv:2404.01223, 2024. 3
-
[27]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3
work page 2021
-
[28]
Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024. 2
-
[29]
Gernot Riegler and Vladlen Koltun. Stable view synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12216–12225, 2021. 2
work page 2021
-
[30]
Simplere- con: 3d reconstruction without 3d convolutions
Mohamed Sayed, John Gibson, Jamie Watson, Victor Prisacariu, Michael Firman, and Cl´ement Godard. Simplere- con: 3d reconstruction without 3d convolutions. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2022. 4
work page 2022
-
[31]
Pixelwise view selection for unstructured multi-view stereo
Johannes L Sch¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Octo- ber 11-14, 2016, Proceedings, Part III 14, pages 501–518. Springer, 2016. 5
work page 2016
-
[32]
Deepvox- els: Learning persistent 3d feature embeddings
Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael Zollhofer. Deepvox- els: Learning persistent 3d feature embeddings. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2437–2446, 2019. 2
work page 2019
-
[33]
Vincent Sitzmann, Michael Zollh¨ofer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure- aware neural scene representations.Advances in neural infor- mation processing systems, 32, 2019. 2
work page 2019
-
[34]
Generalizable patch-based neural ren- dering
Mohammed Suhail, Carlos Esteves, Leonid Sigal, and Ameesh Makadia. Generalizable patch-based neural ren- dering. InEuropean Conference on Computer Vision, pages 156–174. Springer, 2022. 2
work page 2022
-
[35]
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
De- ferred neural rendering: Image synthesis using neural textures
Justus Thies, Michael Zollh¨ofer, and Matthias Nießner. De- ferred neural rendering: Image synthesis using neural textures. Acm Transactions on Graphics (TOG), 38(4):1–12, 2019. 2
work page 2019
-
[37]
Sags: Structure-aware 3d gaussian splatting
Evangelos Ververas, Rolandos Alexandros Potamias, Jifei Song, Jiankang Deng, and Stefanos Zafeiriou. Sags: Structure-aware 3d gaussian splatting. InEuropean Con- ference on Computer Vision, pages 221–238. Springer, 2024. 2
work page 2024
-
[38]
Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction
Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. InThe Twelfth International Conference on Learning Representations, 2023. 2
work page 2023
-
[39]
Ibrnet: Learning multi-view image-based rendering
Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srini- vasan, Howard Zhou, Jonathan T Barron, Ricardo Martin- Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2021. 2
work page 2021
-
[40]
Depth- splat: Connecting gaussian splatting and depth
Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depth- splat: Connecting gaussian splatting and depth. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 16453–16463, 2025. 1, 2
work page 2025
-
[41]
Point-nerf: Point- based neural radiance fields
Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point- based neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5438–5448, 2022. 2
work page 2022
-
[42]
Multi-space neural radiance fields
Ze-Xin Yin, Jiaxiong Qiu, Ming-Ming Cheng, and Bo Ren. Multi-space neural radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12407–12416, 2023. 4
work page 2023
-
[43]
pixelnerf: Neural radiance fields from one or few images
Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4578–4587, 2021. 2
work page 2021
-
[44]
Gs-lrm: Large reconstruction model for 3d gaussian splatting.ArXiv, abs/2404.19702, 2024
Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large recon- struction model for 3d gaussian splatting.arXiv preprint arXiv:2404.19702, 2024. 1, 2, 4
-
[45]
Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025. 2, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Nerfusion: Fusing radiance fields for large- scale scene reconstruction
Xiaoshuai Zhang, Sai Bi, Kalyan Sunkavalli, Hao Su, and Zexiang Xu. Nerfusion: Fusing radiance fields for large- scale scene reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5449–5458, 2022. 2
work page 2022
-
[47]
Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields
Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 3
work page 2024
-
[48]
Autofocusformer: Image segmentation off the grid
Chen Ziwen, Kaushik Patnaik, Shuangfei Zhai, Alvin Wan, Zhile Ren, Alexander G Schwing, Alex Colburn, and Li Fuxin. Autofocusformer: Image segmentation off the grid. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18227–18236, 2023. 4
work page 2023
-
[49]
Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. arXiv preprint 2410.12781, 2024. 1, 2, 3, 4, 5 Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction Supplementary Material
-
[50]
More implementation details Due to its semi-explicit formulation, Long-LRM++ exhibits a stronger tendency to overfit to input frames when training on mixed sets of input and unseen target frames. This ef- fect becomes more pronounced on datasets such as DL3DV , where neighboring frames have relatively large pose differ- ences—that is, the effective frame ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.