pith. sign in

arxiv: 2605.19786 · v1 · pith:I47U6W62new · submitted 2026-05-19 · 💻 cs.CV

Fast 4D Mesh Generation by Spatio-Temporal Attention Chains

Pith reviewed 2026-05-20 06:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D mesh generationspatio-temporal attentiondynamic 3D reconstructionvideo to meshtemporal correspondenceattention chainzero-shot trackingcamera estimation
0
0 comments X

The pith

Spatio-Temporal Attention Chains generate accurate 4D meshes from video in 9 seconds by propagating latent correspondences without explicit matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a training-free framework that accelerates 4D mesh generation from videos while improving temporal consistency. Its central observation is that reliable temporal correspondences appear inside a 4D backbone's latent representations well before the output meshes become visually accurate. The method uses an attention chain to start from vertices on an anchor mesh, map them to latent tokens, follow temporal links across frames in latent space, and recover per-frame vertices via attention. This avoids slow explicit matching steps and preserves geometric details. The result is faster generation that scales to longer sequences and supports downstream tasks such as tracking and camera estimation.

Core claim

The central claim is that temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate. The Spatio-Temporal Attention Chain exploits this by mapping vertices from an anchor mesh to latent tokens, following temporal correspondences in latent space, and recovering frame-specific vertices through latent-to-vertex attention. This design avoids expensive explicit matching while preserving anchor mesh details, thereby improving dynamic mesh geometry and temporal consistency.

What carries the argument

Spatio-Temporal Attention Chain, which propagates vertex information across space and time by mapping anchor vertices to latent tokens, tracking correspondences in latent space, and recovering per-frame vertices via attention.

If this is right

  • Generates a 4D mesh in 9 seconds, achieving a 13 times speedup over prior methods.
  • Processes videos up to 16 times longer without degrading mesh quality.
  • Delivers competitive zero-shot performance on 2D object tracking and 4D tracking tasks.
  • Enables reliable camera estimation, a capability absent from previous 4D mesh generators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reliance on early latent structure suggests the chain could be applied to other video-based 3D generative models where internal representations form before final outputs refine.
  • Speedups of this magnitude could open real-time dynamic reconstruction uses in robotics or augmented reality that current methods cannot support.
  • The separation of correspondence finding from mesh refinement points to a broader pattern where latent-space propagation replaces explicit feature matching in video tasks.

Load-bearing premise

Temporal correspondences become reliable in the latent space of the 4D backbone before the generated meshes achieve visual accuracy.

What would settle it

Extract attention-based correspondences from early latent states of the 4D backbone on videos with ground-truth tracks and check whether their accuracy matches or exceeds that of correspondences from final visually accurate meshes.

Figures

Figures reproduced from arXiv: 2605.19786 by Dvir Samuel, Gal Chechik, Yoni Kasten, Yuval Atzmon.

Figure 1
Figure 1. Figure 1: Method overview. Our attention chain follows a point through the frozen 4D generator: from an anchor mesh vertex to latent tokens, across time to target-frame tokens, and back to a target mesh vertex. Image-patch endpoints give analogous chains for 2D tracking, camera pose estimation, and 4D tracking, without additional training. Stage II: Topology-consistent decoding. To maintain consistent topology, prio… view at source ↗
Figure 2
Figure 2. Figure 2: Long-sequence rollout. (a) Naive autoregressive 4D generation accumulates errors over time, degrading mesh quality. We find that this drift is driven by weakening latent correspondences across windows. (b) Our correspondence reinforcement preserves these correlations, stabilizing the rollout and maintaining high-quality generation. 4.3 Scaling to Longer Sequences Existing 4D generators are trained on short… view at source ↗
Figure 3
Figure 3. Figure 3: 4D Mesh Generation. Our method produces sharp, temporally consistent meshes, aligns them to the input camera, and runs in only 9 sec, compared to 120–900 sec for prior methods. Unlike SG4D and ActionMesh, which generate object-centric meshes, our spatial attention-chain correspondences enable camera recovery and world placement, yielding high foreground overlap (yellow) and fewer mismatch regions (red/gree… view at source ↗
Figure 4
Figure 4. Figure 4: Video-to-4D Scene Alignment. Given our generated 4D object and a reconstructed background scene, we align the object to the environment using our dense 2D-to-3D correspondences. While this example utilizes Depth Anything V3 [45], our alignment framework is completely agnostic and functions seamlessly with any underlying scene reconstruction method. Please refer to our supplementary website for interactive,… view at source ↗
Figure 5
Figure 5. Figure 5: Additional 4D mesh generation results. Side-by-side comparison of ActionMesh [59] (left of each pair) and our method (right) on diverse ActionBench sequences. Each row shows a different input video. Our method produces sharper, temporally consistent meshes with fewer distortions while requiring only 9 s per clip vs. 120 s for ActionMesh. B Ablation Study We probe three design choices: (i) the number of Sta… view at source ↗
Figure 6
Figure 6. Figure 6: Long-sequence autoregressive generation. Mesh quality over 240 frames (sampled at frames 1, 80, 160, 240). ActionMesh progressively degrades due to accumulated drift across autoregressive windows, eventually losing recognizable geometry. Our correspondence reinforcement maintains stable, high-quality meshes throughout the entire sequence. Diffuse-to-Track Ours [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Zero-shot 2D point tracking. Comparison with Denoise-to-Track [92] on challenging sequences with articulated motion. Colored dots mark tracked query points; trails show predicted trajectories. Our attention-chain correspondences yield geometrically grounded tracks that more faithfully follow the underlying surface motion, particularly on limbs and fine structures. even at 30 steps. On CD-Motion ActionMesh … view at source ↗
Figure 8
Figure 8. Figure 8: CD-3D / CD-4D / CD-Motion vs. Stage I denoising steps on ActionBench. Ours plateaus [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
read the original abstract

4D mesh generation has recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates 4D mesh generation while improving temporal correspondence quality. Our key observation is that temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we call Spatio-Temporal Attention Chain which propagates information across space and time. Starting from vertices on an anchor mesh, the chain maps vertices to latent tokens. It then follows temporal correspondences in latent space, and recovers frame-specific vertices through latent-to-vertex attention. This design avoids expensive explicit matching while preserving anchor mesh details and thereby improving dynamic mesh geometry and temporal consistency. Compared to state-of-the-art, our method generates a 4D mesh in 9 seconds, achieving a $13\times$ speedup while producing higher-quality results. Moreover, our approach scales to videos up to $16\times$ longer without degrading mesh quality. Beyond generation, the improved correspondences enable competitive zero-shot performance on two downstream tasks: 2D object tracking and 4D tracking. We further show that our framework enables reliable camera estimation, a capability not supported by prior 4D mesh generation methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces a training-free Spatio-Temporal Attention Chain for fast 4D mesh generation from videos. It exploits the observation that temporal correspondences emerge in the latent space of a 4D backbone well before output meshes achieve visual accuracy. The chain maps anchor-mesh vertices to latent tokens, follows temporal correspondences in latent space, and recovers per-frame vertices via latent-to-vertex attention. Claims include generating a 4D mesh in 9 seconds (13× speedup over SOTA), higher quality, scalability to 16× longer videos without quality loss, and competitive zero-shot results on 2D/4D tracking plus camera estimation.

Significance. If the timing observation is validated and the reported speed/quality gains hold under rigorous testing, the work offers a practical route to scalable 4D reconstruction without retraining. The training-free design and claimed scaling behavior are clear strengths that could broaden applicability in video-based dynamic 3D tasks.

major comments (1)
  1. [Abstract / Key Observation] Abstract / Key Observation: The Spatio-Temporal Attention Chain design rests on the claim that reliable temporal correspondences appear in the 4D backbone's latent space long before the generated meshes become geometrically accurate. No layer-wise or depth-wise measurements (e.g., correspondence endpoint error versus Chamfer distance or normal consistency across successive layers) are supplied to confirm this timing. Without such evidence the chain risks propagating noisy early correspondences, which would undermine both the claimed speedup and the quality improvements.
minor comments (2)
  1. [Abstract] The abstract asserts a 13× speedup and higher-quality results relative to state-of-the-art but does not name the specific baseline methods, datasets, or evaluation metrics used for these comparisons.
  2. [Abstract] Implementation details such as the exact 4D backbone architecture, the number of layers traversed by the attention chain, and how anchor vertices are chosen would clarify reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and commit to revisions that will strengthen the validation of our key observation.

read point-by-point responses
  1. Referee: [Abstract / Key Observation] Abstract / Key Observation: The Spatio-Temporal Attention Chain design rests on the claim that reliable temporal correspondences appear in the 4D backbone's latent space long before the generated meshes become geometrically accurate. No layer-wise or depth-wise measurements (e.g., correspondence endpoint error versus Chamfer distance or normal consistency across successive layers) are supplied to confirm this timing. Without such evidence the chain risks propagating noisy early correspondences, which would undermine both the claimed speedup and the quality improvements.

    Authors: We agree that explicit layer-wise or depth-wise measurements would provide stronger direct support for the timing observation. The manuscript currently motivates the Spatio-Temporal Attention Chain from this observation and validates the overall approach through end-to-end speed, quality, and scaling results, but does not include the requested per-layer correspondence endpoint error versus geometric accuracy comparisons. To address this, we will add the suggested analysis in the revised manuscript, including quantitative plots of correspondence accuracy at successive layers of the 4D backbone alongside Chamfer distance and normal consistency metrics. This will confirm that reliable correspondences emerge early while geometric refinement occurs later. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on empirical observation from external backbones

full rationale

The paper introduces a training-free Spatio-Temporal Attention Chain framework justified by the stated observation that temporal correspondences appear early in 4D backbones. No equations, fitted parameters, or self-citations are shown to reduce the central claims to inputs by construction. The speedup, scaling, and downstream task results are presented as empirical outcomes rather than tautological redefinitions. The method description maps vertices to latents and follows correspondences without invoking self-referential definitions or renaming known results as novel derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests primarily on a domain assumption about attention behavior in 4D backbones rather than new free parameters or invented entities.

axioms (1)
  • domain assumption Temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate.
    This observation is invoked to justify propagating information via the attention chain without explicit matching.

pith-pipeline@v0.9.0 · 5776 in / 1195 out tokens · 34892 ms · 2026-05-20T06:47:22.023128+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We exploit this with a general framework we call Spatio-Temporal Attention Chain which propagates information across space and time. Starting from vertices on an anchor mesh, the chain maps vertices to latent tokens. It then follows temporal correspondences in latent space, and recovers frame-specific vertices through latent-to-vertex attention.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    each attention row is a probability distribution over latent tokens, so multiplying attention maps gives the probability of moving from one representation to the next

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · 7 internal anchors

  1. [1]

    Abnar and W

    S. Abnar and W. Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197. Association for Computational Linguistics, July 2020

  2. [2]

    Atzmon, R

    Y . Atzmon, R. Gal, Y . Tewel, Y . Kasten, and G. Chechik. Identity-motion trade-offs in text-to-video generation. In36th British Machine Vision Conference 2025, BMVC, 2025

  3. [3]

    Biggs, T

    B. Biggs, T. Roddick, A. Fitzgibbon, and R. Cipolla. Creatures great and SMAL: Recovering the shape and motion of animals from video. InACCV, 2018

  4. [4]

    R. C. Bolles and M. A. Fischler. A ransac-based approach to model fitting and its application to finding cylinders in range data. InIjcai, volume 1981, pages 637–643, 1981

  5. [5]

    M. Cao, X. Wang, Z. Qi, Y . Shan, X. Qie, and Y . Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF international conference on computer vision, pages 22560–22570, 2023

  6. [6]

    W. Cao, C. Luo, B. Zhang, M. Nießner, and J. Tang. Motion2vecsets: 4d latent vector set diffusion for non-rigid shape reconstruction and tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20496–20506, 2024

  7. [7]

    Chefer, S

    H. Chefer, S. Gur, and L. Wolf. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 397–406, October 2021

  8. [8]

    J. Chen, B. Zhang, X. Tang, and P. Wonka. V2m4: 4d mesh animation reconstruction from a single monocular video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11643–11653, 2025

  9. [9]

    R. Chen, J. Zhang, Y . Liang, G. Luo, W. Li, J. Liu, X. Li, X. Long, J. Feng, and P. Tan. Dora: Sampling and benchmarking for 3d shape variational auto-encoders. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 16251–16261, June 2025

  10. [10]

    X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen. Easi3r: Estimating disentangled motion from dust3r without training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9158–9168, 2025

  11. [11]

    S. Cho, J. Huang, J. Nam, H. An, S. Kim, and J.-Y . Lee. Local all-pair correspondence for point tracking. InEuropean conference on computer vision, pages 306–325. Springer, 2024

  12. [12]

    Doersch, A

    C. Doersch, A. Gupta, L. Markeeva, A. Recasens, L. Smaira, Y . Aytar, J. Carreira, A. Zisserman, and Y . Yang. Tap-vid: A benchmark for tracking any point in a video.Advances in Neural Information Processing Systems, 35:13610–13626, 2022

  13. [13]

    Doersch, P

    C. Doersch, P. Luc, Y . Yang, D. Gokay, S. Koppula, A. Gupta, J. Heyward, I. Rocco, R. Goroshin, J. Carreira, et al. Bootstap: Bootstrapped training for tracking-any-point. InProceedings of the Asian Conference on Computer Vision, pages 3257–3274, 2024

  14. [14]

    Doersch, Y

    C. Doersch, Y . Yang, M. Vecerik, D. Gokay, A. Gupta, Y . Aytar, J. Carreira, and A. Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10061–10072, 2023

  15. [15]

    N. S. Dutt, S. Muralikrishnan, and N. J. Mitra. Diffusion 3d features (diff3f): Decorating untextured shapes with distilled semantic features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4494–4504, June 2024

  16. [16]

    Y . Erel, O. Dünkel, R. Dabral, V . Golyanik, C. Theobalt, and A. H. Bermano. Attention (as discrete-time markov) chains. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  17. [17]

    H. Feng, J. Zhang, Q. Wang, Y . Ye, P. Yu, M. J. Black, T. Darrell, and A. Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8503–8513, 2025

  18. [18]

    S. Fu, N. Tamir, S. Sundaram, L. Chai, R. Zhang, T. Dekel, and P. Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. InAdvances in Neural Information Processing Systems, volume 36, pages 50742–50768, 2023. 10

  19. [19]

    Z. Guo, J. Xiang, K. Ma, W. Zhou, H. Li, and R. Zhang. Make-it-animatable: An efficient framework for authoring animation-ready 3d characters. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10783–10792, 2025

  20. [20]

    A. W. Harley, Z. Fang, and K. Fragkiadaki. Particle video revisited: Tracking through occlusions using point trajectories. InEuropean Conference on Computer Vision, pages 59–75. Springer, 2022

  21. [21]

    A. W. Harley, Y . You, X. Sun, Y . Zheng, N. Raghuraman, Y . Gu, S. Liang, W.-H. Chu, A. Dave, S. You, et al. Alltracker: Efficient dense point tracking at high resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5253–5262, 2025

  22. [22]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y . Pritch, and D. Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

  23. [23]

    Y . Hong, K. Zhang, J. Gu, S. Bi, Y . Zhou, D. Liu, F. Liu, K. Sunkavalli, T. Bui, and H. Tan. Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023

  24. [24]

    Jeong, C.-H

    H. Jeong, C.-H. P. Huang, J. C. Ye, N. J. Mitra, and D. Ceylan. Track4gen: Teaching video diffusion models to track points improves video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  25. [25]

    Jiang, C

    Y . Jiang, C. Yu, C. Cao, F. Wang, W. Hu, and J. Gao. Animate3d: Animating any 3d model with multi-view video diffusion.Advances in Neural Information Processing Systems, 37:125879–125906, 2024

  26. [26]

    Jiang, L

    Y . Jiang, L. Zhang, J. Gao, W. Hu, and Y . Yao. Consistent4d: Consistent 360° dynamic object generation from monocular video. InThe Twelfth International Conference on Learning Representations, 2024

  27. [27]

    Jiang, C

    Z. Jiang, C. Zheng, I. Laina, D. Larlus, and A. Vedaldi. Mesh4d: 4d mesh reconstruction and tracking from monocular video.arXiv preprint arXiv:2601.05251, 2026

  28. [28]

    L. Jin, R. Tucker, Z. Li, D. Fouhey, N. Snavely, and A. Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10497–10509, 2025

  29. [29]

    Karaev, Y

    N. Karaev, Y . Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6013–6022, 2025

  30. [30]

    Karaev, I

    N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht. Dynamicstereo: Consistent dynamic depth from stereo videos.CVPR, 2023

  31. [31]

    Karaev, I

    N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker: It is better to track together. InEuropean conference on computer vision, pages 18–35. Springer, 2024

  32. [32]

    Karhade, N

    J. Karhade, N. Keetha, Y . Zhang, T. Gupta, A. Sharma, S. Scherer, and D. Ramanan. Any4d: Unified feed-forward metric 4d reconstruction.arXiv preprint arXiv:2512.10935, 2025

  33. [33]

    Kasten, W

    Y . Kasten, W. Lu, and H. Maron. Fast encoder-based 3d from casual videos via point track processing. Advances in Neural Information Processing Systems, 37:96150–96180, 2024

  34. [34]

    M. Kwon, J. Choi, J. Park, S. Jeon, J. Jang, J. Seo, M.-S. Kwak, J.-H. Kim, and S. Kim. Cameo: Correspondence-attention alignment for multi-view diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  35. [35]

    Z. Lai, E. Insafutdinov, E. Sucar, and A. Vedaldi. Cowtracker: Tracking by warping instead of correlation. arXiv preprint arXiv:2602.04877, 2026

  36. [36]

    Le Moing, J

    G. Le Moing, J. Ponce, and C. Schmid. Dense optical tracking: Connecting the dots. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19187–19197, 2024

  37. [37]

    Lepetit, F

    V . Lepetit, F. Moreno-Noguer, and P. Fua. Epnp: Efficient perspective-n-point camera pose estimation. Int. J. Comput. Vis, 81(2):155–166, 2009

  38. [38]

    W. Li, J. Liu, H. Yan, R. Chen, Y . Liang, X. Chen, P. Tan, and X. Long. Craftsman3d: High-fidelity mesh generation with 3d native diffusion and interactive geometry refiner. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5307–5317, 2025. 11

  39. [39]

    W. Li, X. Zhang, Z. Sun, D. Qi, H. Li, W. Cheng, W. Cai, S. Wu, J. Liu, Z. Wang, et al. Step1x-3d: Towards high-fidelity and controllable generation of textured 3d assets.arXiv preprint arXiv:2505.07747, 2025

  40. [40]

    X. Li, F. Zhang, J. Pan, Y . Hou, V . Y . F. Tan, and Z. Yang. Enhancing long video generation consistency without tuning.arXiv preprint arXiv:2412.17254, 2024. ICML 2025 Workshop on Building Physically Plausible World Models

  41. [41]

    TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

    Y . Li, Z.-X. Zou, Z. Liu, D. Wang, Y . Liang, Z. Yu, X. Liu, Y .-C. Guo, D. Liang, W. Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025

  42. [42]

    Z. Li, Y . Chen, and P. Liu. Dreammesh4d: Video-to-4d generation with sparse-controlled gaussian-mesh hybrid representation.Advances in Neural Information Processing Systems, 37:21377–21400, 2024

  43. [43]

    Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V . Ye, A. Kanazawa, A. Holynski, and N. Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10486–10496, 2025

  44. [44]

    Liang, Y

    H. Liang, Y . Yin, D. Xu, H. Liang, Z. Wang, K. N. Plataniotis, Y . Zhao, and Y . Wei. Diffusion4d: Fast spatial-temporal consistent 4d generation via video diffusion models.arXiv preprint arXiv:2405.16645, 2024

  45. [45]

    H. Lin, S. Chen, J. H. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  46. [46]

    I. Liu, Z. Xu, Y . Wang, H. Tan, Z. Xu, X. Wang, H. Su, and Z. Shi. Riganything: Template-free autoregressive rigging for diverse 3d assets.ACM Transactions on Graphics (TOG), 44(4):1–12, 2025

  47. [47]

    X. Liu, Y . Xiao, D. Y . Chen, J. Feng, Y .-W. Tai, C.-K. Tang, and B. Kang. Trace anything: Representing any video in 4d via trajectory fields.arXiv preprint arXiv:2510.13802, 2025

  48. [48]

    R. Lu, Y . Chen, Y . Liu, J. Tang, J. Ni, D. Wan, G. Zeng, and S. Huang. Taco: Taming diffusion for in-the- wild video amodal completion.Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  49. [49]

    Y . Luo, S. Zhou, Y . Lan, X. Pan, and C. C. Loy. 4rc: 4d reconstruction via conditional querying anytime and anywhere.arXiv preprint arXiv:2602.10094, 2026

  50. [50]

    Muralikrishnan, N

    S. Muralikrishnan, N. S. Dutt, and N. J. Mitra. Smf: Template-free and rig-free animation transfer using kinetic codes.ACM Transactions on Graphics (TOG), 44(6), 2025

  51. [51]

    J. Nam, S. Son, D. Chung, J. Kim, S. Jin, J. Hur, and S. Kim. Emergent temporal correspondences from video diffusion transformers. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  52. [52]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . B. Huang, S.-W. Li, I. Misra, M. G. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jégou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features w...

  53. [53]

    Pondaven, A

    A. Pondaven, A. Siarohin, S. Tulyakov, P. Torr, and F. Pizzati. Video motion transfer with diffusion transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  54. [54]

    The 2017 DAVIS Challenge on Video Object Segmentation

    J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

  55. [55]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021

  56. [56]

    J. Ren, L. Pan, J. Tang, C. Zhang, A. Cao, G. Zeng, and Z. Liu. Dreamgaussian4d: Generative 4d gaussian splatting.arXiv preprint arXiv:2312.17142, 2023

  57. [57]

    J. Ren, K. Xie, A. Mirzaei, H. Liang, X. Zeng, K. Kreis, Z. Liu, A. Torralba, S. Fidler, S. W. Kim, et al. L4gm: Large 4d gaussian reconstruction model.Advances in Neural Information Processing Systems, 37:56828–56858, 2024. 12

  58. [58]

    Sabathier, N

    R. Sabathier, N. J. Mitra, and D. Novotny. Lim: Large interpolator model for dynamic reconstruction. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 6154–6164, 2025

  59. [59]

    Sabathier, D

    R. Sabathier, D. Novotny, N. J. Mitra, and T. Monnier. Actionmesh: Animated 3d mesh generation with temporal 3d diffusion.arXiv preprint arXiv:2601.16148, 2026

  60. [60]

    Samuel, M

    D. Samuel, M. Levy, N. Darshan, G. Chechik, and R. Ben-Ari. Omnimattezero: Fast training-free omnimatte with pre-trained video diffusion models. InSIGGRAPH Asia 2025 Conference Papers, 2025

  61. [61]

    Y . Shi, Y . Liu, Y . Wu, X. Liu, C. Zhao, J. Luo, and B. Zhou. Drive any mesh: 4d latent diffusion for mesh deformation from video.arXiv preprint arXiv:2506.07489, 2025

  62. [62]

    Shrivastava, S

    A. Shrivastava, S. Mehta, D. Geng, and A. Owens. Point prompting: Counterfactual tracking with video diffusion models. InInternational Conference on Learning Representations, 2026

  63. [63]

    Siddiqui, T

    Y . Siddiqui, T. Monnier, F. Kokkinos, M. Kariya, Y . Kleiman, E. Garreau, O. Gafni, N. Neverova, A. Vedaldi, R. Shapovalov, et al. Meta 3d assetgen: Text-to-mesh generation with high-quality geometry, texture, and pbr materials.Advances in Neural Information Processing Systems, 37:9532–9564, 2024

  64. [64]

    C. Song, J. Zhang, X. Li, F. Yang, Y . Chen, Z. Xu, J. H. Liew, X. Guo, F. Liu, J. Feng, et al. Magicarticulate: Make your 3d models articulation-ready. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15998–16007, 2025

  65. [65]

    Sorkine and M

    O. Sorkine and M. Alexa. As-rigid-as-possible surface modeling. InProceedings of the Fifth Eurographics Symposium on Geometry Processing, 2007

  66. [66]

    Sucar, E

    E. Sucar, E. Insafutdinov, Z. Lai, and A. Vedaldi. V-DPM: 4d video reconstruction with dynamic point maps. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  67. [67]

    R. W. Sumner, J. Schmid, and M. Pauly. Embedded deformation for shape manipulation. InACM siggraph 2007 papers. 2007

  68. [68]

    J. Tang, Z. Chen, X. Chen, T. Wang, G. Zeng, and Z. Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024

  69. [69]

    L. Tang, M. Jia, Q. Wang, C. P. Phoo, and B. Hariharan. Emergent correspondence from image diffusion. Advances in neural information processing systems, 36:1363–1389, 2023

  70. [70]

    T. H. Team. Hunyuan3d 2.5: Towards high-fidelity 3d assets generation with ultimate details, 2025

  71. [71]

    Terzakis and M

    G. Terzakis and M. Lourakis. A consistently fast and globally optimal solution to the perspective-n-point problem. InEuropean Conference on Computer Vision, pages 478–494. Springer, 2020

  72. [72]

    Tewel, O

    Y . Tewel, O. Kaduri, R. Gal, Y . Kasten, L. Wolf, G. Chechik, and Y . Atzmon. Training-free consistent text-to-image generation.ACM Transactions on Graphics (TOG), 43(4):1–18, 2024

  73. [73]

    Tumanyan, M

    N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1921–1930, June 2023

  74. [74]

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  75. [75]

    Wang, Y .-Y

    Q. Wang, Y .-Y . Chang, R. Cai, Z. Li, B. Hariharan, A. Holynski, and N. Snavely. Tracking everything everywhere all at once. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19795–19806, 2023

  76. [76]

    Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

  77. [77]

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024. 13

  78. [78]

    Y . Wang, X. Wang, Z. Chen, Z. Wang, F. Sun, and J. Zhu. Vidu4d: Single generated video to high-fidelity 4d reconstruction with dynamic gaussian surfels.Advances in Neural Information Processing Systems, 37:131316–131343, 2024

  79. [79]

    R. Wu, R. Gao, B. Poole, A. Trevithick, C. Zheng, J. T. Barron, and A. Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26057–26068, 2025

  80. [80]

    Z. Wu, C. Yu, Y . Jiang, C. Cao, F. Wang, and X. Bai. Sc4d: Sparse-controlled video-to-4d generation and motion transfer. InEuropean Conference on Computer Vision, pages 361–379. Springer, 2024

Showing first 80 references.