arxiv: 2605.05749 · v2 · submitted 2026-05-07 · 💻 cs.CV

Recognition: no theorem link

Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

Feifei Li , Qi Song , Chi Zhang , Rui Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords ray-aware pointer memorystreaming 3D reconstructionadaptive memory updateslong-term stabilitycamera pose accuracyonline reconstructionloop detectionretain-or-replace strategy

0 comments

The pith

Ray-aware pointers that store both 3D position and viewing direction enable selective retain-or-replace updates for stable streaming 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses dense 3D reconstruction from continuous image streams by focusing on memory management rather than appearance similarity alone. It introduces pointers that each record a 3D location, the ray direction from which it was observed, and a feature embedding. This representation supports joint checks on geometric closeness and viewpoint match to decide whether to keep or discard each pointer. The resulting retain-or-replace rule avoids averaging observations and instead preserves distinctive geometry while bounding memory size. When the same checks flag potential loop revisits, pose refinement is applied to maintain global consistency across the growing map.

Core claim

Each memory pointer stores its 3D position, associated ray direction, and feature embedding. An adaptive retain-or-replace strategy then decides updates by jointly evaluating spatial distance and ray-direction discrepancy, replacing fusion-based compression. This unified test distinguishes local redundancy, novel observations, and loop candidates, triggering pose refinement on detected loops to enforce global consistency while keeping memory growth bounded and inference streaming-efficient.

What carries the argument

Ray-aware pointer memory, where each pointer encodes 3D position, ray direction, and feature embedding to support joint spatial-directional reasoning in the retain-or-replace update rule.

If this is right

Redundant observations are discarded rather than averaged, preserving sharp geometric features over long sequences.
Pose refinement is triggered only on loop candidates identified by the same distance-and-direction test, reducing cumulative error.
Memory size remains bounded because each pointer is either retained or replaced instead of merged.
Streaming inference stays efficient since no full fusion computation is performed at every step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pointer representation could be tested in outdoor scenes with large illumination changes to measure whether directional information adds robustness beyond indoor controlled lighting.
If the distinction between redundancy and novelty holds, separate loop-closure modules might become unnecessary in some reconstruction pipelines.
Extending the retain-or-replace rule to include surface-normal consistency could be checked on datasets with thin structures such as poles or wires.

Load-bearing premise

Joint checks on spatial distance and ray-direction discrepancy can correctly separate redundant observations, new data, and loop revisits without any extra detection steps.

What would settle it

A camera trajectory containing viewpoint shifts that produce similar ray directions for truly distinct surfaces, where the retain-or-replace rule would incorrectly discard unique structure and produce measurable drift in the output mesh.

Figures

Figures reproduced from arXiv: 2605.05749 by Chi Zhang, Feifei Li, Qi Song, Rui Huang.

**Figure 1.** Figure 1: Comparison of visualized results of Point3R, our proposed method, and Pseudo GT. Pseudo GT of dense 3D model is view at source ↗

**Figure 2.** Figure 2: Overview of the proposed ray-aware pointer-based streaming reconstruction pipeline. Each incoming frame is view at source ↗

**Figure 3.** Figure 3: Illustration of the pointer update results for a given frame using the view at source ↗

**Figure 4.** Figure 4: Visualized results of reconstruction on datasets NRGBD and 7scenes. view at source ↗

**Figure 5.** Figure 5: Reserved Memory used by the merged method view at source ↗

read the original abstract

Dense 3D reconstruction from continuous image streams requires both accurate geometric aggregation and stable long-term memory management. Recent feed-forward reconstruction frameworks integrate observations through persistent memory representations, yet most rely primarily on appearance-based similarity when updating memory. Such appearance-driven integration often leads to redundant accumulation of observations and unstable geometry when viewpoint changes occur. In this work, we propose a ray-aware pointer memory for streaming 3D reconstruction that explicitly models both spatial location and viewing direction within a unified memory representation. Each memory pointer stores its 3D position, associated ray direction, and feature embedding, allowing the system to reason jointly about geometric proximity and viewpoint consistency. Based on this representation, we introduce an adaptive pointer update strategy that replaces traditional fusion-based memory compression with a retain-or-replace mechanism. Instead of averaging nearby observations, the system selectively retains informative pointers while discarding redundant ones, preserving distinctive geometric structures while maintaining bounded memory growth. Furthermore, the joint reasoning over spatial distance and ray-direction discrepancy enables the system to distinguish between local redundancy, novel observations, and potential loop revisits in a unified manner. When loop candidates are detected, pose refinement is triggered to enforce global geometric consistency across the reconstruction. Extensive experiments demonstrate that the proposed ray-aware memory design significantly improves long-term reconstruction stability and camera pose accuracy while maintaining efficient streaming inference. Our approach provides a principled framework for scalable and drift-resistant online 3D reconstruction from image streams.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds ray direction to memory pointers for retain-or-replace decisions in streaming 3D reconstruction, which is a clean idea, but the joint metric's ability to catch loops under drift is not yet convincing.

read the letter

The core move here is storing each pointer with its 3D position, ray direction, and feature, then using a combined distance-plus-direction score to decide whether to keep, drop, or refine. That replaces the usual averaging fusion and gives a single rule for spotting redundancy, new views, and possible loops. The retain-or-replace policy is straightforward and should keep memory size bounded while avoiding the blurring that comes from blending similar observations. That part reads as a practical improvement over appearance-only memory schemes that the abstract criticizes.

Referee Report

2 major / 1 minor

Summary. The paper proposes a ray-aware pointer memory for streaming 3D reconstruction from continuous image streams. Each memory pointer stores a 3D position, associated ray direction, and feature embedding. An adaptive retain-or-replace update strategy replaces fusion-based compression and uses a joint metric over spatial distance and ray-direction discrepancy to distinguish local redundancy, novel observations, and potential loop revisits; detected loops trigger pose refinement for global consistency. The authors claim this yields improved long-term reconstruction stability and camera pose accuracy with bounded memory and efficient inference, supported by extensive experiments.

Significance. If the joint geometric+directional scoring and retain-or-replace policy prove reliable, the approach offers a principled alternative to appearance-driven memory management in online reconstruction, potentially reducing drift accumulation and fusion artifacts in long sequences. The bounded-memory design and unified handling of redundancy/novelty/loops address practical streaming constraints. However, the absence of concrete quantitative results, baselines, or robustness tests in the abstract makes the practical significance difficult to evaluate at present.

major comments (2)

[Abstract] Abstract: the central claim that 'extensive experiments demonstrate that the proposed ray-aware memory design significantly improves long-term reconstruction stability and camera pose accuracy' is unsupported by any reported metrics, error bars, dataset names, baseline comparisons, or ablation results, which directly undermines verification of the load-bearing performance assertions.
[Method (ray-aware pointer memory and adaptive updates)] Method description of ray-aware pointer memory: the unified distance+direction discrepancy metric is presented as sufficient to separate redundancy, novelty, and loop revisits and to trigger refinement, yet the text provides no analysis or safeguards against accumulating pose drift corrupting ray-direction estimates; this is load-bearing for the stability and accuracy claims because noisy discrepancy signals could cause either loss of useful loop information or spurious refinements.

minor comments (1)

[Abstract] Abstract: consider adding one sentence summarizing the evaluation datasets and key quantitative gains (e.g., pose error reduction or stability metric) to make the contribution more concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the abstract and method description, and we will revise the manuscript accordingly to improve clarity and verifiability.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'extensive experiments demonstrate that the proposed ray-aware memory design significantly improves long-term reconstruction stability and camera pose accuracy' is unsupported by any reported metrics, error bars, dataset names, baseline comparisons, or ablation results, which directly undermines verification of the load-bearing performance assertions.

Authors: We agree that the abstract should include concrete quantitative support. In the revised version we will expand the final sentence to report specific metrics (e.g., mean reconstruction error and pose RMSE reductions on ScanNet and TUM-RGBD), error bars from repeated runs, named baselines, and reference to the ablation studies already present in the experimental section. This change will make the performance claims directly verifiable from the abstract. revision: yes
Referee: [Method (ray-aware pointer memory and adaptive updates)] Method description of ray-aware pointer memory: the unified distance+direction discrepancy metric is presented as sufficient to separate redundancy, novelty, and loop revisits and to trigger refinement, yet the text provides no analysis or safeguards against accumulating pose drift corrupting ray-direction estimates; this is load-bearing for the stability and accuracy claims because noisy discrepancy signals could cause either loss of useful loop information or spurious refinements.

Authors: We acknowledge the importance of this robustness consideration. Ray directions are recorded at observation time and are updated during the global pose refinement step that is triggered on loop detection; the discrepancy metric therefore operates on the refined poses for revisited regions. To make this explicit, we will add a short subsection discussing the effect of residual drift on the joint metric, introduce an uncertainty-weighted variant of the discrepancy score as a safeguard, and include a targeted experiment that injects controlled pose noise to quantify sensitivity. These additions will directly address the load-bearing concern. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents a proposed system architecture (ray-aware pointer memory with retain-or-replace updates and joint spatial+direction scoring) whose correctness is asserted via design description and external experiments rather than any mathematical derivation that reduces to its own inputs. No equations, fitted parameters renamed as predictions, self-citation load-bearing uniqueness theorems, or ansatz smuggling appear in the provided text. The central claim that the unified metric distinguishes redundancy/novelty/loop revisits is introduced as a novel mechanism, not derived from prior results by the same authors. This is the common honest non-finding for a systems paper whose contributions are algorithmic and empirical.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper's contribution centers on this new memory structure and update rule, which are postulated without external validation in the abstract.

axioms (1)

domain assumption 3D geometry and viewing directions can be jointly used to manage memory updates
Fundamental to the ray-aware design

invented entities (1)

ray-aware pointer memory no independent evidence
purpose: To store and update 3D observations with spatial and directional information
Core new concept introduced for the reconstruction system

pith-pipeline@v0.9.0 · 5556 in / 1160 out tokens · 64142 ms · 2026-05-13T07:43:34.486441+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

[1]

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. 2011. Building rome in a day.Commun. ACM54, 10 (2011), 105–112

work page 2011
[2]

Sameer Agarwal, Noah Snavely, Steven M Seitz, and Richard Szeliski. 2010. Bundle adjustment in the large. InEuropean conference on computer vision. Springer, 29– 42

work page 2010
[3]

Dejan Azinović, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. 2022. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6290–6301

work page 2022
[4]

Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. 2012. A naturalistic open source movie for optical flow evaluation. InEuropean conference on computer vision. Springer, 611–625

work page 2012
[5]

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. 2025. Ttt3r: 3d reconstruction as test-time training.arXiv preprint arXiv:2509.26645 (2025)

work page arXiv 2025
[6]

Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, and Hang Zhao

work page
[7]

InProceedings of the IEEE/CVF International Conference on Computer Vision

Long3r: Long sequence streaming 3d reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision. 5273–5284

work page
[8]

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition. 5828–5839

work page 2017
[9]

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. 2025. VGGT-Long: Chunk it, Loop it, Align it–Pushing VGGT’s Limits on Kilometer-scale Long RGB Sequences. arXiv preprint arXiv:2507.16443(2025)

work page arXiv 2025
[10]

Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. 2019. D2-net: A trainable cnn for joint description and detection of local features. InProceedings of the ieee/cvf conference on computer vision and pattern recognition. 8092–8101

work page 2019
[11]

Qiancheng Fu, Qingshan Xu, Yew Soon Ong, and Wenbing Tao. 2022. Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruc- tion.Advances in Neural Information Processing Systems35 (2022), 3403–3416

work page 2022
[12]

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets robotics: The kitti dataset.The international journal of robotics research32, 11 (2013), 1231–1237

work page 2013
[13]

Wen Jiang, Boshu Lei, and Kostas Daniilidis. 2024. Fisherrf: Active view selec- tion and mapping with radiance fields using fisher information. InEuropean Conference on Computer Vision. Springer, 422–440

work page 2024
[14]

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al

work page
[15]

Graph.42, 4 (2023), 139–1

3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph.42, 4 (2023), 139–1

work page 2023
[16]

Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. 2021. Robust consistent video depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1611–1621

work page 2021
[17]

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. 2025. Stream3r: Scalable sequential 3d reconstruction with causal transformer.arXiv preprint arXiv:2508.10893(2025)

work page arXiv 2025
[18]

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. 2024. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision. Springer, 71–91

work page 2024
[19]

Feifei Li, Panwen Hu, Qi Song, and Rui Huang. 2024. Incremental 3D Re- construction through a Hybrid Explicit-and-Implicit Representation. In2024 IEEE International Conference on Robotics and Automation (ICRA). 15121–15127. doi:10.1109/ICRA57147.2024.10610868

work page doi:10.1109/icra57147.2024.10610868 2024
[20]

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. 2023. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF international conference on computer vision. 17627–17638

work page 2023
[21]

David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision60, 2 (2004), 91–110

work page 2004
[22]

Dominic Maggio and Luca Carlone. 2026. VGGT-SLAM 2.0: Real time Dense Feed-forward Scene Reconstruction.arXiv preprint arXiv:2601.19887(2026)

work page arXiv 2026
[23]

Dominic Maggio, Hyungtae Lim, and Luca Carlone. 2025. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549(2025)

work page arXiv 2025
[24]

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis.Commun. ACM65, 1 (2021), 99–106

work page 2021
[25]

Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. 2019. ReFusion: 3D reconstruction in dynamic environments for RGB- D cameras exploiting residuals. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 7855–7862

work page 2019
[26]

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. ORB: An efficient alternative to SIFT or SURF. In2011 International conference on computer vision. Ieee, 2564–2571

work page 2011
[27]

Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition. 4104–4113

work page 2016
[28]

Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys

work page
[29]

InEuropean conference on computer vision

Pixelwise view selection for unstructured multi-view stereo. InEuropean conference on computer vision. Springer, 501–518

work page
[30]

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. 2013. Scene coordinate regression forests for camera relocalization in RGB-D images. InProceedings of the IEEE conference on computer vision and pattern recognition. 2930–2937

work page 2013
[31]

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision. Springer, 746–760

work page 2012
[32]

Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. 2012. A benchmark for the evaluation of RGB-D SLAM systems. In2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 573–580

work page 2012
[33]

Chris Sweeney, Torsten Sattler, Tobias Hollerer, Matthew Turk, and Marc Polle- feys. 2015. Optimizing the viewing graph for structure-from-motion. InProceed- ings of the IEEE international conference on computer vision. 801–809

work page 2015
[34]

Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon

work page
[35]

InInternational workshop on vision algorithms

Bundle adjustment—a modern synthesis. InInternational workshop on vision algorithms. Springer, 298–372

work page
[36]

Hengyi Wang and Lourdes Agapito. 2025. 3d reconstruction with spatial memory. In2025 International Conference on 3D Vision (3DV). IEEE, 78–89

work page 2025
[37]

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. 2025. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference. 5294– 5306

work page 2025
[38]

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. 2025. Continuous 3d perception model with persistent state. In Proceedings of the Computer Vision and Pattern Recognition Conference. 10510– 10522

work page 2025
[39]

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. 2024. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20697–20709

work page 2024
[40]

Yi Wei, Shaohui Liu, Yongming Rao, Wang Zhao, Jiwen Lu, and Jie Zhou. 2021. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. InProceedings of the IEEE/CVF international conference on computer vision. 5610–5619

work page 2021
[41]

Changchang Wu. 2013. Towards linear-time incremental structure from motion. In2013 International Conference on 3D Vision-3DV 2013. IEEE, 127–134

work page 2013
[42]

Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. 2025. Point3r: Streaming 3d re- construction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863 (2025)

work page arXiv 2025
[43]

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. 2025. Fast3r: Towards 3d recon- struction of 1000+ images in one forward pass. InProceedings of the Computer Vision and Pattern Recognition Conference. 21924–21935

work page 2025
[44]

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. 2026. InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams.arXiv preprint arXiv:2601.02281 (2026)

work page arXiv 2026
[45]

Chi Zhang, Qi Song, Feifei Li, Jie Li, and Rui Huang. 2025. Improving Hierarchical Representations of Vectorized HD Maps with Perspective Clues.IEEE Robotics and Automation Letters(2025)

work page 2025
[46]

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. 2024. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arXiv:2410.03825(2024)

work page arXiv 2024
[47]

Zhoutong Zhang, Forrester Cole, Zhengqi Li, Michael Rubinstein, Noah Snavely, and William T Freeman. 2022. Structure and motion from casual videos. In European Conference on Computer Vision. Springer, 20–37

work page 2022