{"total":12,"items":[{"citing_arxiv_id":"2605.12162","ref_index":60,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction","primary_cat":"cs.RO","submitted_at":"2026-05-12T14:13:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"methods have integrated explicit 3D encoders directly into Vision-Language- Action (VLA) architectures [29,33,34,46,55] and utilized 3D perception capa- bility of visual foundation models [24,36,51,73]. For example, GeoVLA [55] and PointVLA [33] introduce a point encoder alongside the VLM, fusing them at the action head without disrupting the VLM backbone. Base on a frozen geometry- awarevisiontransformerVGGT[60],GLaD[24]andEvo-0[36]extractgeometric features from RGB images, thereby enhancing the spatial understanding of VLA. These methods have enhanced spatial awareness by dedicated architectural de- signs which are orthogonal approaches to our X-Imitator framework. Flow Representations.Recognizing the inherently dynamic nature of manip- ulation, another line of research employs flow as an intermediate representation"},{"citing_arxiv_id":"2605.01799","ref_index":50,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Embody4D: A Generalist Data Engine for Embodied 4D World Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-03T09:39:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00345","ref_index":45,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Pose-Aware Diffusion for 3D Generation","primary_cat":"cs.CV","submitted_at":"2026-05-01T02:05:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26454","ref_index":39,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation","primary_cat":"cs.CV","submitted_at":"2026-04-29T09:11:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Layer analysis of DINOv3 shows non-uniform 3D geometric knowledge concentrated in deeper layers, enabling a last-layer-centric recombination module that improves monocular depth estimation accuracy to state-of-the-art levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22686","ref_index":53,"ref_count":3,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SS3D: End2End Self-Supervised 3D from Web Videos","primary_cat":"cs.CV","submitted_at":"2026-04-24T16:12:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"riculum sampling, and with expert training distilled into a single stu- dent. Pretraining on YouTube-8M (∼100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning per- formance over prior self-supervised baselines. We release the pretrained checkpoint and code. arXiv:2604.22686v1 [cs.CV] 24 Apr 2026 2 M. Hariat et al. 1 Introduction Recent progress in AI such as VGGT [53] and MapAnything [23] show that a single network can infer multiple geometric quantities from monocular video, in- cluding depth, camera motion, and camera intrinsics, without running a classical SfM pipeline [47], photometric stereo [21], or shape-from-shading [40,52]. However, despite their effectiveness, current 3D-centric foundation models share"},{"citing_arxiv_id":"2604.17565","ref_index":70,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models","primary_cat":"cs.CV","submitted_at":"2026-04-19T18:11:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"generalization to unseen camera trajectories. Inspired by prior work [33,84,86], we introduce point cloud sequences as geometric guidance, supplying the video generation model with explicit geometric priors (as shown in Fig. 2(a)) . During training, given an input video V = [ I0, . . . , IN−1 ] ∈R N×3×H×W , where N denotes the number of frames, we first employ VGGT [70] to estimate the camera pose of each frame, yielding the camera trajectoryC = {C0, . . . , CN−1 }. Meanwhile, a point cloudP0 is reconstructed from the first frameI0 using the pre-trained VGGT model [70]. We then move the virtual camera along the estimated camera trajectory C = {C0, . . . , CN−1 }and render the point cloud to obtain a sequence of renderings:"},{"citing_arxiv_id":"2604.14302","ref_index":47,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Geometrically Consistent Multi-View Scene Generation from Freehand Sketches","primary_cat":"cs.CV","submitted_at":"2026-04-15T18:00:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"decouplesspatialandtemporalattentionforscenereconstruction.ViewCrafter[54] guidesvideodiffusionwithmonoculardepth-basedpointclouds,andGeoVideo[2] aligns predicted depth maps across frames during training. In all these cases, ge- ometric reasoning is an emergent property of the architecture rather than an explicitly supervised objective. In the reconstruction domain, DUSt3R [48], MASt3R [23], and VGGT [47] learn to predict dense correspondences from multi-view images, but treat them as the end product rather than as supervision for generation. Correspondent- Dream [18] extracts correspondences from a frozen diffusion UNet to regularise NeRF optimisation, yet does not train the generative model itself with corre- spondence supervision. The concurrent CAMEO [21] is closest to our work, supervising attention"},{"citing_arxiv_id":"2604.08500","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Novel View Synthesis as Video Completion","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:44:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"based downsampling, pixel unshuffle preserves per-pixel ray information by rear- ranging spatial details into channels without loss, maintaining precise geometric correspondence with VAE latents and retaining fine-grained camera geometry. Much prior work defined cameras (and rays) with respect to a world coordinate system aligned to the first input view [40,42], but this introduces a dependence on the ordering of input views. Instead, we define the world coordinate system with respect to the query view, preserving invariance (to permutations of input views). Finally, we define the scale of the world coordinate system by normalizing all input cameras to have a mean distance of unit length (as in past work [33])."},{"citing_arxiv_id":"2604.05182","ref_index":61,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows","primary_cat":"cs.CV","submitted_at":"2026-04-06T21:21:12+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Keywords:Object-centric feed-forward reconstruction·Sparse atten- tion·3D foundation model 1 Introduction Recent years have witnessed rapid progress in the application of 3D foundation models-typically built on large-scale transformer architectures [60]-to tackle 3D tasks previously considered intractable. These tasks include joint estimation of geometry and camera parameters [33,47,61,62], dynamic scene reconstruc- tion [44,46,49,69,78], and sparse-view reconstruction [22,26,63,79,84] and in- verse rendering [41,75]. In object-centric reconstruction and inverse rendering arXiv:2604.05182v1 [cs.CV] 6 Apr 2026 2 Z. Li et al. Fig.1: High-fidelity 3D reconstruction.Given 12-18 images (left), LSRM adapts Native Sparse Attention (NSA) to generate explicit meshes and textures in a single"},{"citing_arxiv_id":"2603.24577","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-03-25T17:53:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EndoVGGT uses a dynamic DeGAT graph attention module to improve depth estimation and non-rigid 3D reconstruction in surgery, reporting 24.6% PSNR and 9.1% SSIM gains on SCARED with zero-shot generalization to new domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.17980","ref_index":53,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding","primary_cat":"cs.CV","submitted_at":"2026-03-18T17:42:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at higher speed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.08831","ref_index":83,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"3AM: 3egment Anything with Geometric Consistency in Videos","primary_cat":"cs.CV","submitted_at":"2026-01-13T18:59:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"cause cross-view consistency is not enforced, leading to view-inconsistent masks and fragmented instances [31,58,73,94,116]. Our approach instead imposes 3D- aware self-consistency directly on video, preserving identity and geometry with- out 3D annotations or point-cloud mask merging. End-to-End3D-AwareMethods.Recentend-to-end3Dreconstructionmod- els directly infer geometric structure from 2D inputs [7,9,11,21,39,44,48,67, 70,81,83,85,86,89,99,106]. The DUSt3R/MASt3R family has been extended to dynamic scenes [115], incremental reconstruction with spatial memory [82], feed-forward Gaussian splatting [68], efficient multi-view fusion [74], real-time dense SLAM [53], unconstrained structure-from-motion [19], and training-free 3AM 5 4D reconstruction [10]. Feed-forward models now also jointly predict geometry"}],"limit":50,"offset":0}