pith. machine review for the scientific record. sign in

arxiv: 2605.06270 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

Haijie Li, Jian Zhang, Jiaqi Zhang, Jiaye Fu, Qiankun Gao, Siwei Ma, Yanmin Wu, Zecheng Tang

Pith reviewed 2026-05-08 13:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D reconstructionvision transformerstoken reductionfeed-forward modelsaccelerationtoken mergingasymmetric compressionpruning
0
0 comments X

The pith

Asymmetric compression of query versus key-value tokens speeds feed-forward 3D reconstruction up to 28 times without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Feed-forward 3D reconstruction models based on vision transformers turn small sets of images into scene geometry and camera poses, but attention costs grow quadratically and make long video inputs impractical. The central observation is that query tokens carry view-specific geometric requests and lose quality under heavy compression, while key-value tokens hold shared scene context that survives more aggressive reduction. Spark3R therefore applies different reduction methods and rates to each: intra-group merging for queries and lightweight pruning for key-value tokens, with the pruning rate adjusted automatically across layers. The resulting training-free plug-in works on existing models and processes up to a thousand frames at far lower cost while keeping reconstruction quality close to the uncompressed baseline. Readers would care because this distinction turns an otherwise intractable scaling barrier into a manageable engineering choice.

Core claim

Query tokens encode view-specific geometric requests and remain sensitive to compression, whereas key-value tokens represent shared scene context and tolerate aggressive compression. Spark3R exploits this split by assigning distinct reduction factors, using intra-group token merging on queries and lightweight token pruning on key-value tokens, plus an adaptive schedule that changes the key-value reduction factor layer by layer. The framework requires no retraining and inserts directly into pretrained models such as VGGT, π³, and Depth-Anything-3, producing up to 28× speedup on 1,000-frame inputs while preserving competitive reconstruction quality.

What carries the argument

Asymmetric token reduction that applies intra-group merging to query tokens and adaptive lightweight pruning to key-value tokens.

If this is right

  • Pretrained models can process video-length inputs with hundreds or thousands of frames at practical speeds.
  • No retraining or architectural changes are needed, so the method works immediately on published checkpoints.
  • Reconstruction quality stays competitive rather than trading off sharply for speed.
  • Layer-wise adaptation of the key-value reduction factor improves the quality-efficiency balance beyond fixed-rate pruning.
  • The same plug-in pattern applies to multiple distinct feed-forward 3D architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The functional split between query and key-value roles may exist in other vision-transformer tasks that process multi-view or sequential data, suggesting similar asymmetric reductions could be tested there.
  • Future model designs could explicitly separate view-specific and scene-shared pathways to make such acceleration easier to apply by default.
  • Evaluating the method on inputs with strong motion or changing lighting would test whether the tolerance of key-value tokens holds under more variable scene conditions.

Load-bearing premise

Query tokens are always more sensitive to compression than key-value tokens, and this distinction remains reliable across different pretrained models and input lengths without further tuning.

What would settle it

Apply the same aggressive pruning rate to both query and key-value tokens on 1,000-frame sequences and measure whether reconstruction quality falls below the level achieved by the asymmetric method.

Figures

Figures reproduced from arXiv: 2605.06270 by Haijie Li, Jian Zhang, Jiaqi Zhang, Jiaye Fu, Qiankun Gao, Siwei Ma, Yanmin Wu, Zecheng Tang.

Figure 1
Figure 1. Figure 1: Compression sensitivity of different token roles in VGGT. We separately compress query tokens (orange), key-value tokens (blue), and both jointly (red) at increasing reduction factors and report pose error (ATE ↓). Key-value tokens tolerate aggressive compression with negligible quality loss, while query tokens degrade sharply beyond a reduction factor of 12. Joint uniform compression yields the steepest c… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Spark3R. (Top) Spark3R applies asymmetric token reduction to the global attention layers of a feed-forward 3D reconstruction model, with separate reduction factors rQ and rKV (rKV > rQ in general). (Middle) A layer-adaptive key-value reduction schedule assigns each layer a large or small rKV based on its measured sensitivity to compression. (Bottom) Detailed illustration of the asymmetric reduc… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of inter-frame distances between merged source– view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of cosine similarities between matched source–destination view at source ↗
Figure 6
Figure 6. Figure 6: Per-layer sensitivity to key-value pruning in view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison with unaccelerated base models. Each pair shows the original model and its Spark3R-accelerated counterpart. Spark3R preserves fine-grained geometric details and produces point clouds visually comparable to the unaccelerated baselines. Notably, for VGGT, Spark3R even improves the reconstruction quality by alleviating attention dilution on long sequences. FastVGGT TTT3R ZipMap Ours+𝜋 O… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison with other acceleration methods. FastVGGT produces blurred geometry, while TTT3R exhibits noticeable artifacts. ZipMap yields more complete results but still suffers from subtle structural distortions (e.g., misaligned door parts in the red dashed box). Spark3R+VGGT substantially sharpens the reconstruction over FastVGGT, and Spark3R applied to π 3 and DA3 further surpasses ZipMap wi… view at source ↗
Figure 9
Figure 9. Figure 9: ATE and wall-clock merging time as a function of the group size view at source ↗
Figure 10
Figure 10. Figure 10: Merging vs. pruning for key-value tokens. Both strategies use the same temporal stride partitioning into source and destination tokens. Top: ATE as a function of the number of input frames; both achieve nearly identical pose error. Bottom: wall-clock token reduction time. Token merging grows superlinearly due to the bipartite similarity computation, while token pruning remains near zero throughout. wall-c… view at source ↗
read the original abstract

Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignoring their functionally distinct roles in 3D reconstruction. In this work, we identify a key property of feed-forward 3D reconstruction models: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Guided by this insight, we propose Spark3R, a training-free acceleration framework that decouples the compression of query tokens and key-value tokens by assigning distinct reduction factors, with intra-group token merging applied to query tokens and lightweight token pruning to key-value tokens. Additionally, Spark3R adaptively adjusts the key-value reduction factor across layers, further improving the quality-efficiency trade-off. As a plug-and-play framework requiring no retraining, Spark3R integrates directly into multiple pretrained feed-forward 3D reconstruction models, including VGGT, $\pi^3$, and Depth-Anything-3, and achieves up to $28\times$ speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Spark3R, a training-free plug-and-play framework for accelerating feed-forward 3D reconstruction models based on Vision Transformers. It claims that query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. The method decouples their treatment by applying intra-group token merging to queries and lightweight pruning to KV tokens, with adaptive per-layer KV reduction factors, and reports integration into VGGT, π³, and Depth-Anything-3, achieving up to 28× speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.

Significance. If the empirical claims hold, Spark3R provides a practical, training-free method to scale 3D reconstruction to long video sequences by mitigating the quadratic cost of global attention without retraining. The asymmetric reduction strategy based on functional token roles is a targeted optimization that could generalize to other attention-heavy vision models. The plug-and-play integration and high reported speedup are notable strengths that address a real scalability bottleneck in the field.

major comments (2)
  1. [Abstract and §3] Abstract and §3: The central claim rests on the functional distinction that query tokens are compression-sensitive while KV tokens are tolerant, yet no ablation studies, sensitivity analyses, or quantitative comparisons (e.g., quality drop when applying uniform vs. asymmetric reduction) are referenced to establish this distinction or demonstrate its generalization across the tested models (VGGT, π³, Depth-Anything-3) and long input sequences.
  2. [Evaluation] Evaluation (as referenced in abstract claims): The abstract asserts up to 28× speedup with maintained competitive quality on 1,000-frame inputs, but provides no specific quantitative metrics (e.g., reconstruction error, PSNR/accuracy scores, runtime breakdowns), baseline comparisons, or error analysis, making it impossible to verify the quality-efficiency trade-off or the adaptive KV reduction's contribution.
minor comments (1)
  1. [Abstract] Abstract: The model name π³ uses LaTeX rendering that may not render consistently in plain text; ensure uniform notation throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments identify areas where the manuscript would benefit from greater explicitness and additional supporting experiments. We address each point below and commit to revisions that strengthen the presentation of the core claims without altering the technical approach.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3: The central claim rests on the functional distinction that query tokens are compression-sensitive while KV tokens are tolerant, yet no ablation studies, sensitivity analyses, or quantitative comparisons (e.g., quality drop when applying uniform vs. asymmetric reduction) are referenced to establish this distinction or demonstrate its generalization across the tested models (VGGT, π³, Depth-Anything-3) and long input sequences.

    Authors: We agree that the manuscript would be stronger with explicit ablations directly comparing uniform versus asymmetric reduction. Section 3 motivates the distinction from the roles of queries (view-specific geometric requests) versus KV tokens (shared scene context) in the attention layers of feed-forward 3D models, but dedicated quantitative validation was not included. In the revision we will add a new ablation subsection (and corresponding table) that reports reconstruction accuracy, PSNR, and error metrics under uniform reduction, our asymmetric strategy, and layer-wise adaptive KV pruning. These experiments will cover all three evaluated models and input lengths up to 1,000 frames, with sensitivity analysis on the reduction factors. revision: yes

  2. Referee: [Evaluation] Evaluation (as referenced in abstract claims): The abstract asserts up to 28× speedup with maintained competitive quality on 1,000-frame inputs, but provides no specific quantitative metrics (e.g., reconstruction error, PSNR/accuracy scores, runtime breakdowns), baseline comparisons, or error analysis, making it impossible to verify the quality-efficiency trade-off or the adaptive KV reduction's contribution.

    Authors: The current abstract summarizes the headline result, but the referee is correct that specific numbers, runtime breakdowns, and direct baseline comparisons are not referenced from the abstract or evaluation section. The revision will expand the abstract to cite the relevant tables/figures and add a concise summary paragraph in Section 4 that reports concrete metrics (PSNR, depth accuracy, wall-clock time) for the 28× speedup case on 1,000-frame inputs, together with comparisons against uniform token merging and the unaccelerated baselines. We will also include an error analysis of the adaptive KV reduction factor and its contribution to the observed trade-off. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical heuristic with independent experimental validation

full rationale

The paper's central contribution is an empirical observation about differential compression sensitivity of query vs. KV tokens in pretrained ViT-based 3D reconstructors, followed by a training-free asymmetric reduction scheme (intra-group merging for queries, pruning for KV, with adaptive per-layer factors). No load-bearing step reduces to a self-definition, fitted parameter renamed as prediction, or self-citation chain. The method is presented as a plug-and-play heuristic validated on external models (VGGT, π³, Depth-Anything-3) and long sequences; the functional distinction is not derived from the method itself but tested against it. This matches the default expectation of a non-circular empirical acceleration paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated; the method rests on an empirical observation about token sensitivity that is treated as given for the design of the reduction rules.

pith-pipeline@v0.9.0 · 5578 in / 1069 out tokens · 37543 ms · 2026-05-08T13:36:27.852912+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 16 canonical work pages · 4 internal anchors

  1. [1]

    Structure-from-motion revisited,

    J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” inProc. of CVPR, 2016, pp. 4104–4113

  2. [2]

    Photo tourism: exploring photo collections in 3d,

    N. Snavely, S. M. Seitz, and R. Szeliski, “Photo tourism: exploring photo collections in 3d,”ACM Trans. Graph., vol. 25, no. 3, p. 835–846, Jul. 2006. [Online]. Available: https://doi.org/10.1145/1141911.1141964

  3. [3]

    Pixelwise view selection for unstructured multi-view stereo,

    J. L. Sch ¨onberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, “Pixelwise view selection for unstructured multi-view stereo,” inProc. of ECCV. Springer, 2016, pp. 501–518

  4. [4]

    Mvsnet: Depth inference for unstructured multi-view stereo,

    Y . Yao, Z. Luo, S. Li, T. Fang, and L. Quan, “Mvsnet: Depth inference for unstructured multi-view stereo,” inProc. of ECCV, 2018, pp. 767– 783

  5. [5]

    Nerf: Representing scenes as neural radiance fields for view synthesis,

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021

  6. [6]

    3d gaussian splatting for real-time radiance field rendering

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Transactions on Graphics, vol. 42, no. 4, pp. 139–1, 2023

  7. [7]

    Scaffold- gs: Structured 3d gaussians for view-adaptive rendering,

    T. Lu, M. Yu, L. Xu, Y . Xiangli, L. Wang, D. Lin, and B. Dai, “Scaffold- gs: Structured 3d gaussians for view-adaptive rendering,” inProc. of CVPR, 2024, pp. 20 654–20 664

  8. [8]

    HiCoM: Hierarchical coherent motion for dynamic streamable scenes with 3D gaussian splatting,

    Q. Gao, J. Meng, C. Wen, J. Chen, and J. Zhang, “HiCoM: Hierarchical coherent motion for dynamic streamable scenes with 3D gaussian splatting,” inProc. of NeurIPS, 2024. 10

  9. [9]

    Recon- gs: Continuum-preserved gaussian streaming for fast and compact re- construction of dynamic scenes,

    J. Fu, Q. Gao, C. Wen, Y . Wu, S. Ma, J. Zhang, and J. Zhang, “Recon- gs: Continuum-preserved gaussian streaming for fast and compact re- construction of dynamic scenes,” inProc. of NeurIPS, 2025

  10. [10]

    Virpnet: A multimodal virtual point generation network for 3d object detection,

    L. Wang, S. Sun, and J. Zhao, “Virpnet: A multimodal virtual point generation network for 3d object detection,”IEEE Transactions on Multimedia, vol. 26, pp. 10 597–10 609, 2024

  11. [11]

    Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction

    S. Yang, L. Xu, H. Li, J. Mu, J. Zeng, D. Lin, and J. Pang, “Robo3r: Enhancing robotic manipulation with accurate feed-forward 3d recon- struction,”arXiv preprint arXiv:2602.10101, 2026

  12. [12]

    Language-assisted 3d scene understanding,

    Y . Wu, Q. Gao, R. Zhang, H. Li, and J. Zhang, “Language-assisted 3d scene understanding,”IEEE Transactions on Multimedia, vol. 27, pp. 3869–3879, 2025

  13. [13]

    3ur-llm: An end- to-end multimodal large language model for 3d scene understanding,

    H. Xiong, Y . Zhuge, J. Zhu, L. Zhang, and H. Lu, “3ur-llm: An end- to-end multimodal large language model for 3d scene understanding,” IEEE Transactions on Multimedia, vol. 27, pp. 2899–2911, 2025

  14. [14]

    Vggt: Visual geometry grounded transformer,

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProc. of CVPR, 2025

  15. [15]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Y . Wang, J. Zhou, H. Zhu, W. Chang, Y . Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He, “π 3: Scalable permutation-equivariant visual geometry learning,”arXiv preprint arXiv:2507.13347, 2025

  16. [16]

    Depth Anything 3: Recovering the Visual Space from Any Views

    H. Lin, S. Chen, J. H. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,”arXiv preprint arXiv:2511.10647, 2025

  17. [17]

    Continuous 3d perception model with persistent state,

    Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa, “Continuous 3d perception model with persistent state,” inProc. of CVPR, 2025, pp. 10 510–10 522

  18. [18]

    Ttt3r: 3d reconstruction as test-time training

    X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen, “Ttt3r: 3d reconstruction as test-time training,”arXiv preprint arXiv:2509.26645, 2025

  19. [19]

    InfiniteVGGT: Visual geometry grounded transformer for endless streams

    S. Yuan, Y . Yang, X. Yang, X. Zhang, Z. Zhao, L. Zhang, and Z. Zhang, “Infinitevggt: Visual geometry grounded transformer for endless streams,”arXiv preprint arXiv:2601.02281, 2026

  20. [20]

    Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences,

    K. Deng, Z. Ti, J. Xu, J. Yang, and J. Xie, “Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences,” arXiv preprint arXiv:2507.16443, 2025

  21. [21]

    Laser: Layer-wise scale alignment for training-free streaming 4d reconstruction,

    T. Ding, Y . Xie, Y . Liang, M. Chatterjee, P. Miraldo, and H. Jiang, “Laser: Layer-wise scale alignment for training-free streaming 4d reconstruction,” 2026. [Online]. Available: https://arxiv.org/abs/2512. 13680

  22. [22]

    Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

    Y . Shen, Z. Zhang, Y . Qu, X. Zheng, J. Ji, S. Zhang, and L. Cao, “Fastvggt: Training-free acceleration of visual geometry transformer,” arXiv preprint arXiv:2509.02560, 2025

  23. [23]

    Litevggt: Boosting vanilla vggt via geometry-aware cached token merging.arXiv preprint arXiv:2512.04939, 2025

    Z. Shu, C. Lin, T. Xie, W. Yin, B. Li, Z. Pu, W. Li, Y . Yao, X. Cao, X. Guo, and X.-X. Long, “Litevggt: Boosting vanilla vggt via geometry-aware cached token merging,” 2025. [Online]. Available: https://arxiv.org/abs/2512.04939

  24. [24]

    Dust3r: Geometric 3d vision made easy,

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inProc. of CVPR, 2024, pp. 20 697– 20 709

  25. [25]

    Grounding image matching in 3d with mast3r,

    V . Leroy, Y . Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” inProc. of ECCV. Springer, 2024, pp. 71–91

  26. [26]

    arXiv preprint arXiv:2507.02863 (2025)

    Y . Wu, W. Zheng, J. Zhou, and J. Lu, “Point3r: Streaming 3d reconstruction with explicit spatial pointer memory,” 2025. [Online]. Available: https://arxiv.org/abs/2507.02863

  27. [27]

    arXiv preprint arXiv:2507.11539 (2025)

    D. Zhuo, W. Zheng, J. Guo, Y . Wu, J. Zhou, and J. Lu, “Streaming 4d visual geometry transformer,” 2026. [Online]. Available: https://arxiv.org/abs/2507.11539

  28. [28]

    Vgg-t 3: Offline feed-forward 3d reconstruction at scale,

    S. Elflein, R. Li, S. Agostinho, Z. Gojcic, L. Leal-Taix ´e, Q. Zhou, and A. Osep, “Vgg-t 3: Offline feed-forward 3d reconstruction at scale,”

  29. [29]

    Available: https://arxiv.org/abs/2602.23361

    [Online]. Available: https://arxiv.org/abs/2602.23361

  30. [30]

    Zipmap: Linear-time stateful 3d reconstruction via test-time training,

    H. Jin, R. Wu, T. Zhang, R. Gao, J. T. Barron, N. Snavely, and A. Holynski, “Zipmap: Linear-time stateful 3d reconstruction via test-time training,” 2026. [Online]. Available: https://arxiv.org/abs/2603. 04385

  31. [31]

    T., and Tan, H

    T. Zhang, S. Bi, Y . Hong, K. Zhang, F. Luan, S. Yang, K. Sunkavalli, W. T. Freeman, and H. Tan, “Test-time training done right,” 2025. [Online]. Available: https://arxiv.org/abs/2505.23884

  32. [32]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023. [Online]. Available: https://arxiv.org/abs/1706.03762

  33. [33]

    FlashAttention: Fast and memory-efficient exact attention with IO-awareness,

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” 2022

  34. [34]

    Dinov2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”Transactions on Machine Learning Research Journal, pp. 1–31, 2024

  35. [35]

    Vision transformers for dense prediction,

    R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inProc. of CVPR, 2021, pp. 12 179–12 188

  36. [36]

    Token merging: Your ViT but faster,

    D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your ViT but faster,” inICLR, 2023

  37. [37]

    Token merging for fast stable diffusion,

    D. Bolya and J. Hoffman, “Token merging for fast stable diffusion,”

  38. [38]

    Available: https://arxiv.org/abs/2303.17604

    [Online]. Available: https://arxiv.org/abs/2303.17604

  39. [39]

    arXiv preprint arXiv:2602.16284 , year=

    A. Zweiger, X. Fu, H. Guo, and Y . Kim, “Fast kv compaction via attention matching,” 2026. [Online]. Available: https://arxiv.org/abs/ 2602.16284

  40. [40]

    Scene coordinate regression forests for camera relocalization in rgb-d images,

    J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgib- bon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” inProc. of CVPR, 2013, pp. 2930–2937

  41. [41]

    Neural rgb-d surface reconstruction,

    D. Azinovi ´c, R. Martin-Brualla, D. B. Goldman, M. Nießner, and J. Thies, “Neural rgb-d surface reconstruction,” inProc. of CVPR, 2022, pp. 6290–6301

  42. [42]

    A benchmark for the evaluation of rgb-d slam systems,

    J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” inProc. of IROS. IEEE, 2012, pp. 573–580

  43. [43]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes,

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProc. of CVPR, 2017, pp. 5828–5839

  44. [44]

    A naturalistic open source movie for optical flow evaluation,

    D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” inProc. of ECCV. Springer, 2012, pp. 611–625

  45. [45]

    Re- fusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals,

    E. Palazzolo, J. Behley, P. Lottes, P. Giguere, and C. Stachniss, “Re- fusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals,” inProc. of IROS. IEEE, 2019, pp. 7855–7862

  46. [46]

    Vision meets robotics: The kitti dataset,

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013. Zecheng Tangis currently pursuing the M.S. degree in computer science and technology with Peking University Shenzhen Graduate School, Shen- zhen, China. His research interests incl...