{"total":13,"items":[{"citing_arxiv_id":"2605.28490","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-27T13:45:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SSR3D-LLM improves fine-grained 3D grounding in unified 3D-LLMs by generating and scoring sequences of latent spatial reasoning steps from the query using fixed Mask3D proposals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01365","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection","primary_cat":"cs.CV","submitted_at":"2026-05-02T10:21:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VoxAfford fuses multi-scale voxel features into MLLM output tokens using cross-attention with a learned compatibility gate to achieve SOTA open-vocabulary 3D affordance detection with ~8% mIoU gain and zero-shot robot transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21160","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment","primary_cat":"cs.CV","submitted_at":"2026-04-23T00:01:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Geometric Reward Credit Assignment disentangles rewards to geometric tokens and adds reprojection consistency to boost 3D keypoint accuracy from 0.64 to 0.93 and bounding box IoU to 0.686 on a ShapeNetCore benchmark while preserving 2D performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05695","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2026-04-07T10:45:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"we reformulate this task as 3D spatial-temporal video ground- ing. The model is required to simultaneously localize the target object's bounding box in the camera coordinate sys- tem and predict its corresponding frame index. For evalua- tion, we report the accuracy at intersection-over-union (IoU) thresholds of 0.25 and 0.5 (Acc@0.25 and Acc@0.5). • 3D Dense Captioning (Scan2Cap [12]):This task requires the model to generate descriptive text for various objects within a 3D scene. Following previous conventions, we prompt the model to generate captions conditioned on the object center coordinates. To better leverage the injected visual geo- metric features, all coordinates are uniformly transformed to the coordinate system of the initial frame."},{"citing_arxiv_id":"2604.02689","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs","primary_cat":"cs.CV","submitted_at":"2026-04-03T03:32:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Efficient3D prunes visual tokens in 3D MLLMs via DVTIE and ATR modules, reporting better performance than unpruned baselines on Scan2Cap and other benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Kaipeng Zhang, Fei Chao, Rongrong Ji, Yu Qiao, and Ping Luo. Diffrate: Differentiable compression rate for efficient vision transformers. InICCV, 2023. 2 [9] Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understand- ing reasoning and planning. InCVPR, 2024. 2 [10] Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Ruiyuan Lyu, Runsen Xu, Dahua Lin, and Jiangmiao Pang. Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024. 2 [11] Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. Scan2cap: Context-aware dense captioning in rgb-d scans. InCVPR, 2021. 1, 6 [12] Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen,"},{"citing_arxiv_id":"2604.03318","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs","primary_cat":"cs.CV","submitted_at":"2026-04-01T15:28:13+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tial cognition tasks, recent studies have introduced various strategies to enhance spatial understanding capabilities. 3D prior-based approaches focus on integrating explicit 3D information into MLLMs to improve spatial reasoning and scene comprehension. LL3DA [6] and LEO [17] em- ploy additional 3D branches to incorporate point clouds for enhanced scene-level understanding. Grounded 3D- LLM [7] designs a cross-modal interaction module to im- prove fine-grained object reasoning in 3D space, while Chat3D [43] and ChatScene [54] utilize 3D detectors or segmentors to extract explicit object features from 3D modalities. Beyond point clouds, 3D-LLM [15], Scene- LLM [13], and LLaV A-3D [60] leverage camera param- eters to project multi-view 2D features into correspond-"},{"citing_arxiv_id":"2603.27507","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM","primary_cat":"cs.CV","submitted_at":"2026-03-29T04:16:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03296","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"3D-IDE: 3D Implicit Depth Emergent","primary_cat":"cs.CV","submitted_at":"2026-03-28T00:54:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"3D awareness emerges implicitly in MLLMs via self-supervised geometric constraints that create an information bottleneck, removing depth and pose dependencies at inference and cutting latency by 55%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.17980","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding","primary_cat":"cs.CV","submitted_at":"2026-03-18T17:42:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at higher speed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.23365","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpatialMosaic: A Multiview VLM Dataset for Partial Visibility","primary_cat":"cs.CV","submitted_at":"2025-12-29T10:48:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpatialMosaic introduces a 2M-pair multi-view QA dataset and 1M-pair benchmark for MLLMs on spatial reasoning under partial visibility, plus a hybrid baseline that integrates 3D reconstruction models as geometry encoders.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.05199","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DEGround: An Effective Baseline for Ego-centric 3D Visual Grounding with a Homogeneous Framework","primary_cat":"cs.CV","submitted_at":"2025-06-05T16:11:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DEGround presents a unified homogeneous framework for 3D visual grounding with shared queries and two plug-in modules for better instruction alignment, reporting a 7.52% improvement on the EmbodiedScan benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.23747","ref_index":44,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence","primary_cat":"cs.CV","submitted_at":"2025-05-29T17:59:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.06239","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Open-Architecture End-to-End System for Real-World Autonomous Robot Navigation","primary_cat":"cs.RO","submitted_at":"2024-10-08T17:54:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Presents an open ROS2-based end-to-end navigation system for quadruped robots achieving over 88% success in zero-shot real-world indoor navigation tasks using semantic scene graphs and LLM planning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}