{"total":19,"items":[{"citing_arxiv_id":"2606.22131","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Feed-forward Motion In-betweening for Any 4D","primary_cat":"cs.CV","submitted_at":"2026-06-20T16:18:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes a feed-forward keyframe-conditioned in-betweening method for arbitrary 4D meshes using a topology-agnostic VAE and MMDiT-based rectified flow model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30268","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:29:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PhyGenHOI couples a motion diffusion model for humans with material point method simulation for objects on 3D Gaussians, using attraction loss, contact re-simulation, and masked video-SDS to produce physically consistent dynamic interactions from text.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26109","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Helix4D: Complex 4D Mesh Generation","primary_cat":"cs.CV","submitted_at":"2026-05-25T17:59:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Helix4D generates high-quality dynamic 4D meshes from videos by extending Trellis2 with sliding-window cross-frame attention anchored on the first frame and a repurposed 4D temporal encoding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21489","ref_index":61,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Variance Reduction for Expectations with Diffusion Teachers","primary_cat":"cs.LG","submitted_at":"2026-05-20T17:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CARV amortizes upstream diffusion teacher costs over noise resamples with timestep importance sampling and stratified-inverse-CDF sampling, delivering 2-3x effective compute gains in text-to-3D experiments and order-of-magnitude variance cuts in single-step distillation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19786","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fast 4D Mesh Generation by Spatio-Temporal Attention Chains","primary_cat":"cs.CV","submitted_at":"2026-05-19T12:51:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A training-free Spatio-Temporal Attention Chain framework accelerates 4D mesh generation 13x, improves quality, scales to 16x longer videos, and supports downstream tracking and camera estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18010","ref_index":292,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Functionalization via Structure Completion and Motion Rectification","primary_cat":"cs.CV","submitted_at":"2026-05-18T08:05:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13838","ref_index":116,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow","primary_cat":"cs.CV","submitted_at":"2026-05-13T17:58:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13129","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation","primary_cat":"cs.GR","submitted_at":"2026-05-13T07:55:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04527","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Velox: Learning Representations of 4D Geometry and Appearance","primary_cat":"cs.CV","submitted_at":"2026-05-06T06:12:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth simulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021. 7 [69] Davis Rempe, Tolga Birdal, Yongheng Zhao, Zan Gojcic, Srinath Sridhar, and L. Guibas. Caspr: Learning canonical spatiotemporal point cloud representations, 2020. 2 [70] Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, and Ziwei Liu. Dreamgaussian4d: Generative 4d gaussian splatting.arXiv preprint arXiv:2312.17142, 2023. 2, 8 [71] Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xi- aohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim, and Huan Ling. L4gm: Large 4d gaussian reconstruction model."},{"citing_arxiv_id":"2604.26917","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation","primary_cat":"cs.CV","submitted_at":"2026-04-29T17:27:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantically accurate, temporally coherent animations in seconds.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"*Corresponding author: Xiang Bai. A preliminary version of this work was presented at the IEEE/CVF International Conference on Computer Vision (ICCV), 2025 [1] challenges due to the complexity of spatio-temporal distribu- tions and the scarcity of 4D training data. Existing 4D generation approaches generally fall into two categories: per-instance optimization methods [6]-[10] uti- lizing SDS [11], and multi-view dynamic video generation methods [12], [13]. The former suffers from high compu- tational costs and inconsistency, while the latter relies on post-processing for 4D reconstruction, impeding real-time applications. Moreover, these methods, which typically adopt dynamic 3DGS [14] or NeRF [15] as 4D representations,"},{"citing_arxiv_id":"2604.21592","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-04-23T12:18:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sculpt4D generates temporally coherent 4D shapes by integrating a block sparse attention mechanism with time-decaying mask into a pretrained 3D diffusion transformer, achieving SOTA results with 56% less computation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09045","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting","primary_cat":"cs.CV","submitted_at":"2026-04-10T07:07:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A scene-agnostic object codebook learned via unsupervised object-centric learning provides consistent identity-anchored representations for 3D Gaussians across multiple scenes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08746","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AniGen: Unified $S^3$ Fields for Animatable 3D Asset Generation","primary_cat":"cs.GR","submitted_at":"2026-04-09T20:22:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AniGen directly generates animatable 3D assets with consistent shape, skeleton, and skinning from single images using unified S^3 fields and a two-stage flow-matching pipeline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06168","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Action Images: End-to-End Policy Learning via Multiview Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-07T17:59:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"colosseum: A benchmark for evaluating generalization for robotic manipu- lation. arXiv preprint arXiv:2402.08191 (2024) [48] Rajbhandari,S.,Rasley,J.,Ruwase,O.,He,Y.:Zero:Memoryoptimizations toward training trillion parameter models. In: SC20: international confer- ence for high performance computing, networking, storage and analysis. pp. 1-16. IEEE (2020) [49] Ren, J., Pan, L., Tang, J., Zhang, C., Cao, A., Zeng, G., Liu, Z.: Dreamgaus- sian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142 (2023) [50] Shen, W., Yang, G., Yu, A., Wong, J., Kaelbling, L.P., Isola, P.: Distilled feature fields enable few-shot language-guided manipulation. arXiv preprint arXiv:2308.07931 (2023) [51] Singer, U."},{"citing_arxiv_id":"2602.04876","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation","primary_cat":"cs.CV","submitted_at":"2026-02-04T18:58:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PerpetualWonder introduces a closed-loop generative simulator with a unified physical-visual representation for long-horizon action-conditioned 4D scene generation from one image.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.07435","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation","primary_cat":"cs.CV","submitted_at":"2025-09-09T06:43:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LGAA is a modular adapter framework that lifts multi-view diffusion models to produce 2D Gaussian Splats with PBR channels for high-quality relightable 3D mesh extraction using data-efficient finetuning on 69k instances.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.13109","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models","primary_cat":"cs.CV","submitted_at":"2025-04-17T17:24:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniEdit-Flow presents tuning-free Uni-Inv and Uni-Edit methods for inversion and editing in flow models that achieve accurate reconstruction and robust region-preserving edits across generative models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.09176","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LIVE-GS: LLM Powers Interactive VR Experience with Physics-Aware Gaussian Splatting","primary_cat":"cs.HC","submitted_at":"2024-12-12T11:06:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LIVE-GS uses an LLM to predict physical parameters from static Gaussian assets in 10 seconds for physics-aware VR interactions, validated by interviews, baseline comparisons, and user studies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.03890","ref_index":187,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on 3D Gaussian Splatting","primary_cat":"cs.CV","submitted_at":"2024-01-08T13:42:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey compiling principles, applications, benchmarks, and challenges of 3D Gaussian Splatting for explicit 3D scene representation.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"• Point-based rendering represents another class of rendering algorithms, of which 3D GS introduces a notable implementation. Its simplest form [58] rasterizes point clouds with a fixed size, which introduces drawbacks such as holes and rendering artifacts. Seminal works addressed these limitations through various methods, including:i) splatting point primitives with a spatial extent [187, 316-318], and ii) more recently, embedding neural features directly into points for subsequent network-based rendering [4, 190]. 3D GS uses 3D Gaussian as the point primitive that contains explicit attributes (e.g., color and opacity) instead of implicit neural features. The rendering approach, i.e., point-based 𝛼-blending (exemplified in Eq. 5), shares the same image formation model as NeRF-style volumetric rendering (Eq."}],"limit":50,"offset":0}