Recognition: unknown
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
Pith reviewed 2026-05-10 02:00 UTC · model grok-4.3
The pith
AnyRecon reconstructs 3D scenes from arbitrary unordered sparse views by coupling video diffusion with explicit geometric memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AnyRecon constructs a persistent global scene memory via a prepended capture view cache and removes temporal compression to maintain frame-level correspondence. It couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval, combined with 4-step diffusion distillation and context-window sparse attention for efficiency. This enables robust reconstruction from irregular inputs, large viewpoint gaps, and long trajectories.
What carries the argument
The geometry-aware conditioning strategy that uses an explicit 3D geometric memory and geometry-driven capture-view retrieval to couple generation and reconstruction.
If this is right
- Reconstruction remains consistent even with large viewpoint gaps and long trajectories.
- The system scales to diverse and large-scale 3D scenes beyond what single or dual frame conditioning allows.
- Flexible conditioning on varying numbers of input frames without loss of geometric control.
- Efficient processing through distilled diffusion steps and sparse attention mechanisms.
Where Pith is reading between the lines
- This approach could extend traditional reconstruction techniques by incorporating generative capabilities for filling in missing views in sparse captures.
- Applications in robotics or AR might benefit from handling casual, unordered video inputs without requiring structured capture paths.
- Further improvements in memory management could allow even longer sequences or real-time updates.
Load-bearing premise
That the interplay between generation and reconstruction enforced by the explicit 3D geometric memory will maintain consistency across large viewpoint changes without introducing new inconsistencies.
What would settle it
A test reconstruction on a scene with extremely large viewpoint gaps or very long trajectories that shows visible geometric distortions or inconsistencies compared to ground truth.
Figures
read the original abstract
Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AnyRecon, a scalable framework for arbitrary-view 3D reconstruction from unordered sparse inputs using a video diffusion model. It constructs a persistent global scene memory via a prepended capture-view cache, removes temporal compression to preserve frame-level correspondence, and introduces geometry-aware conditioning that couples generation and reconstruction through an explicit 3D geometric memory plus geometry-driven capture-view retrieval. Efficiency is achieved via 4-step diffusion distillation and context-window sparse attention. The abstract states that extensive experiments demonstrate robust reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.
Significance. If the consistency and scalability claims hold, the work could meaningfully advance generative 3D reconstruction by supporting flexible conditioning cardinalities and explicit geometric control beyond the one- or two-frame limits of prior diffusion approaches. The explicit memory and retrieval mechanism, combined with distillation for efficiency, represents a potentially useful direction for handling casual captures with large viewpoint variations.
major comments (1)
- [geometry-aware conditioning strategy] The geometry-driven capture-view retrieval (described in the abstract as coupling generation and reconstruction via explicit 3D geometric memory) presupposes sufficiently accurate initial 3D geometry to select relevant views, yet obtaining that geometry is the core reconstruction task from arbitrary sparse inputs. This creates a potential circularity risk for large viewpoint gaps, and the manuscript must clarify the bootstrapping procedure and provide targeted ablations demonstrating that retrieval improves rather than introduces inconsistencies.
minor comments (2)
- [Abstract] Abstract contains grammatical issues: 'mitigates this issues' should read 'mitigate this issue'; 'Beyond better generative model' should read 'Beyond a better generative model'.
- [Abstract] The abstract claims 'extensive experiments demonstrate robust and scalable reconstruction' but does not preview specific quantitative metrics (e.g., PSNR/SSIM on novel-view synthesis or reconstruction error) or baseline comparisons; these should be summarized to allow readers to gauge the magnitude of improvement.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for identifying a key point that requires clarification in our manuscript. We address the major comment below and commit to revisions that strengthen the presentation of the geometry-aware conditioning strategy.
read point-by-point responses
-
Referee: [geometry-aware conditioning strategy] The geometry-driven capture-view retrieval (described in the abstract as coupling generation and reconstruction via explicit 3D geometric memory) presupposes sufficiently accurate initial 3D geometry to select relevant views, yet obtaining that geometry is the core reconstruction task from arbitrary sparse inputs. This creates a potential circularity risk for large viewpoint gaps, and the manuscript must clarify the bootstrapping procedure and provide targeted ablations demonstrating that retrieval improves rather than introduces inconsistencies.
Authors: We appreciate this observation on potential circularity. In AnyRecon the explicit 3D geometric memory is initialized from the sparse unordered input views themselves by first running a lightweight coarse reconstruction step (pose estimation and initial point cloud via a standard SfM pipeline such as COLMAP applied directly to the provided frames). This coarse geometry, while not final, is sufficient to drive the initial retrieval of relevant cached capture views for conditioning the diffusion model. As novel views are generated, the geometric memory is updated with the new 3D information, enabling progressive refinement and retrieval in subsequent steps. The process is therefore iterative rather than presupposing a complete accurate geometry upfront, which directly addresses large viewpoint gaps by bootstrapping from whatever inputs are available. We acknowledge that the current manuscript description of this bootstrapping procedure is insufficiently explicit and will revise the method section to provide a clear step-by-step account, including pseudocode for the initialization and update loop. In addition, we will add targeted ablation experiments that isolate the retrieval component (with vs. without geometry-driven selection) on sequences exhibiting large viewpoint changes, reporting metrics for geometric consistency and artifact reduction to demonstrate that retrieval improves rather than harms reconstruction quality. revision: yes
Circularity Check
No significant circularity: framework is explicitly constructed without self-referential reductions
full rationale
The paper proposes AnyRecon as a new scalable framework for arbitrary-view 3D reconstruction, describing explicit components such as a prepended capture view cache for persistent global scene memory, removal of temporal compression to maintain frame-level correspondence, and a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory plus geometry-driven retrieval. No equations, derivations, fitted parameters presented as predictions, or self-citations that serve as load-bearing justifications appear in the provided text. The central claims rest on the construction of these architectural choices to address limitations in prior diffusion-based methods, without reducing to self-definitional loops, renamed empirical patterns, or ansatzes smuggled via prior self-work. The derivation chain is therefore self-contained as a proposed engineering solution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Bai,J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H., Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025) 2, 4
2025
-
[2]
In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers
Cao, C., Zhou, J., Li, S., Liang, J., Yu, C., Wang, F., Xue, X., Fu, Y.: Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video gen- eration. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–12 (2025) 2, 4, 6, 11, 13
2025
-
[3]
arXiv preprint arXiv:2602.06028 (2026)
Chen, S., Wei, C., Sun, S., Nie, P., Zhou, K., Zhang, G., Yang, M.H., Chen, W.: Context forcing: Consistent autoregressive video generation with long context. arXiv preprint arXiv:2602.06028 (2026) 3
-
[4]
arXiv preprint arXiv:2506.10981 (2025) 4
Chen, W., Bi, J., Huang, Y., Zheng, W., Duan, Y.: Scenecompleter: Dense 3d scene completion for generative novel view synthesis. arXiv preprint arXiv:2506.10981 (2025) 4
-
[5]
In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) (June 2022) 4
Deng, K., Liu, A., Zhu, J.Y., Ramanan, D.: Depth-supervised NeRF: Fewer views and faster training for free. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) (June 2022) 4
2022
-
[6]
Advances in Neural Information Processing Systems (2024) 2, 4
Gao*, R., Holynski*, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P.P., Barron, J.T., Poole*, B.: Cat3d: Create anything in 3d with multi-view dif- fusion models. Advances in Neural Information Processing Systems (2024) 2, 4
2024
-
[7]
In: International Con- ference on Learning Representations (2022),https://openreview.net/forum?id= nZeVKeeFYf911
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Con- ference on Learning Representations (2022),https://openreview.net/forum?id= nZeVKeeFYf911
2022
-
[8]
ACM Trans
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., et al.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023) 1
2023
-
[9]
ACM Transactions on Graphics36(4) (2017) 12
Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics36(4) (2017) 12
2017
-
[10]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Li, R., Torr, P., Vedaldi, A., Jakab, T.: Vmem: Consistent interactive video scene generation with surfel-indexed view memory. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 25690–25699 (2025) 3
2025
-
[11]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22160–22169 (2024) 10, 12, 14
2024
-
[12]
arXiv preprint arXiv:2512.21252 (2025) 2
Liu, J., Li, J., Deng, J., Li, G., Zhou, S., Fang, Z., Lao, S., Deng, Z., Zhu, J., Ma, T., et al.: Dreamontage: Arbitrary frame-guided one-shot video generation. arXiv preprint arXiv:2512.21252 (2025) 2
-
[13]
Liu, X., Chen, J., Kao, S.h., Tai, Y.W., Tang, C.K.: Deceptive-nerf: Enhancing nerf reconstruction using pseudo-observations from diffusion models (2023) 4
2023
-
[14]
Commu- nications of the ACM65(1), 99–106 (2021) 1
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM65(1), 99–106 (2021) 1
2021
-
[15]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Niemeyer, M., Barron, J.T., Mildenhall, B., Sajjadi, M.S., Geiger, A., Radwan, N.: Regnerf: Regularizing neural radiance fields for view synthesis from sparse in- puts. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5480–5490 (2022) 4 AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model 17
2022
-
[16]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Ren,X.,Shen,T.,Huang,J.,Ling,H.,Lu,Y.,Nimier-David,M.,Müller,T.,Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6121–6132 (2025) 2, 4
2025
-
[17]
IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2023) 4
Truong, P., Rakotosaona, M.J., Manhardt, F., Tombari, F.: Sparf: Neural radiance fields from sparse and noisy poses. IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2023) 4
2023
-
[18]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025) 4, 5
2025
-
[20]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wang, Q., Zhang, Y., Holynski, A., Efros, A.A., Kanazawa, A.: Continuous 3d perception model with persistent state. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10510–10522 (2025) 4
2025
-
[21]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20697–20709 (2024) 4
2024
-
[22]
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π 3: Permutation-equivariant visual geometry learning. arXiv preprint arXiv:2507.13347 (2025) 4, 5, 9, 11
work page internal anchor Pith review arXiv 2025
-
[23]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Wu, J.Z., Zhang, Y., Turki, H., Ren, X., Gao, J., Shou, M.Z., Fidler, S., Gojcic, Z., Ling, H.: Difix3d+: Improving 3d reconstructions with single-step diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26024–26035 (2025) 2, 11, 13
2025
-
[24]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wu, R., Mildenhall, B., Henzler, P., Park, K., Gao, R., Watson, D., Srinivasan, P.P., Verbin, D., Barron, J.T., Poole, B., et al.: Reconfusion: 3d reconstruction with diffusion priors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21551–21561 (2024) 2, 4
2024
-
[25]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yang, J., Pavone, M., Wang, Y.: Freenerf: Improving few-shot neural rendering with free frequency regularization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8254–8263 (2023) 4
2023
-
[26]
In: NeurIPS (2024) 7, 11
Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T.: Improved distribution matching distillation for fast image synthesis. In: NeurIPS (2024) 7, 11
2024
-
[27]
In: CVPR (2024) 7
Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: CVPR (2024) 7
2024
-
[28]
In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers
Yu, J., Bai, J., Qin, Y., Liu, Q., Wang, X., Wan, P., Zhang, D., Liu, X.: Context as memory: Scene-consistent interactive long video generation with memory retrieval. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–11 (2025) 3
2025
-
[29]
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024) 2, 4, 6, 11, 13 18 Y. Chen et al
work page internal anchor Pith review arXiv 2024
-
[30]
Advances in neural information processing systems35, 25018–25032 (2022) 4
Yu, Z., Peng, S., Niemeyer, M., Sattler, T., Geiger, A.: Monosdf: Exploring monoc- ular geometric cues for neural implicit surface reconstruction. Advances in neural information processing systems35, 25018–25032 (2022) 4
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.