{"total":13,"items":[{"citing_arxiv_id":"2606.27345","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation","primary_cat":"cs.CV","submitted_at":"2026-06-25T17:51:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RayPE extends video DiT attention with Plucker coordinates and a gated reciprocal-product term to improve 3D consistency and camera controllability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30060","ref_index":93,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Consistent Video Geometry Estimation","primary_cat":"cs.CV","submitted_at":"2026-05-28T15:11:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ViGeo is a feed-forward transformer for video geometry that introduces dynamic chunking attention and a completion-based data refinement framework to achieve SOTA on depth, normals, and point map estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27367","ref_index":141,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpatialBench: Is Your Spatial Foundation Model an All-Round Player?","primary_cat":"cs.CV","submitted_at":"2026-05-26T17:59:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26519","ref_index":80,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"$R^3$: 3D Reconstruction via Relative Regression","primary_cat":"cs.CV","submitted_at":"2026-05-26T04:03:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"R³ uses relative regression with confidence-weighted constraints from an MLP to support long-context offline and streaming 3D reconstruction without global coordinate assumptions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23903","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Geo-Align: Video Generation Alignment via Metric Geometry Reward","primary_cat":"cs.CV","submitted_at":"2026-05-22T17:59:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23889","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-05-22T17:50:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15178","ref_index":89,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:58:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher throughput than prior open baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12774","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WildPose: A Unified Framework for Robust Pose Estimation in the Wild","primary_cat":"cs.CV","submitted_at":"2026-05-12T21:39:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01896","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models","primary_cat":"cs.CV","submitted_at":"2026-05-03T14:22:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and long-term consistency in multi-modal video generation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"dimensional, deploying world simulation and its downstream applications often necessitates information from multiple modalities. Consequently, a growing body of research has explored multi-modal world models [4,6,17,26,61,64], which generate videos beyond just RGB. For instance, TesserAct [61] learns a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. Aether [64] develops an RGB-Depth model and leverages both modalities for reconstruction and planning. WorldWeaver [26] jointly models RGB, depth, and optical flow to enhance long video generation. While our work shares the goal of multi- modalvideo worldmodeling, itdistinguishes itselfbydistilling strongpriorsfrom modality-specific foundation models into video representations, simultaneously"},{"citing_arxiv_id":"2605.01799","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Embody4D: A Generalist Data Engine for Embodied 4D World Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-03T09:39:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(7) The resultingu t′ serves as the geometric anchor for aligning the robotic arm in the new perspective. 4.2 Confidence-aware Adaptive Noise Injection Strategy Most existing 4D generation methods overlook the inherent spatial variations in reliability when applying priors warped from reconstructed dynamic point clouds (from source to target views) [36,58,62]. In low-confidence regions, such as occlu- sions and boundaries, over-constraining the model with inaccurate warp priors 7 encodes flawed spatial mappings, creating persistent artifacts that compromise the restorative capacity of the generation process. Conversely, in high-confidence visible regions, the uniform sampling process of standard diffusion models intro-"},{"citing_arxiv_id":"2604.08995","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory","primary_cat":"cs.CV","submitted_at":"2026-04-10T06:00:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive distillation on a 5B model.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"synthetic pipeline with tick-synchronized video from navigation-mesh-based exploration, stochastic camera control, and a combinatorial character assembly system yielding over 108 variants; (ii) a scalable four-layer decoupled recording architecture that automates capture from multiple AAA titles at terabyte scale; and (iii) diverse real-world corpora (DL3DV-10K [25], RealEstate10K [58], OmniWorld [59], and SpatialVid [42]) spanning indoor, urban, aerial, and vehicular scenes. Together they produce high-quality annotated video data at industrial scale, directly addressing the supervision bottleneck for long interactive rollouts. Modeling.Bridging strong bidirectional video priors with a streaming inference paradigm in- troduces inherent trade-offs among long-horizon memory, coherence, controllability, and error"},{"citing_arxiv_id":"2604.08532","ref_index":82,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Improving 4D Perception via Self-Distillation","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:59:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.10647","ref_index":129,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Depth Anything 3: Recovering the Visual Space from Any Views","primary_cat":"cs.CV","submitted_at":"2025-11-13T18:59:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"the precision and recall of the reconstructionR with respect to a distance thresholdd. Precision is given by 1 |R| P\u0002 dist(Ri → G) < d \u0003 , and recall by 1 |G| P\u0002 dist(Gi → R) < d \u0003 , where [·] denotes the Iverson bracket [46]. To jointly capture both measures, we report the F1-score, computed as F1=2×precision×recall precision+recall . 6.3 Datasets Our benchmark is built on five datasets: HiRoom [129], ETH3D [72], DTU [1], 7Scenes [74], and Scan- Net++ [117]. Together, they cover diverse scenarios ranging from object-centric captures to complex indoor and outdoor environments, and are widely adopted in prior research. Below, we present more details about the dataset preparation process. HiRoomis a Blender-rendered synthetic dataset comprising 30 indoor living scenes created by professional"}],"limit":50,"offset":0}