{"total":19,"items":[{"citing_arxiv_id":"2606.01247","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?","primary_cat":"cs.CV","submitted_at":"2026-05-31T14:00:10+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23898","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SPACENUM: Revisiting Spatial Numerical Understanding in VLMs","primary_cat":"cs.AI","submitted_at":"2026-05-22T17:58:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLMs fail to ground numerical values in spatial perception on new bidirectional tasks, relying on shallow cues instead of coordinate-aware representations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16713","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GeoWorld-VLM: Geometry from World Models for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-15T23:52:11+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12449","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LychSim: A Controllable and Interactive Simulation Framework for Vision Research","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:40:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LychSim introduces a controllable simulation platform on Unreal Engine 5 with Python API, procedural generation, and LLM integration for vision research tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"PerceptualTaxonomy [24] required the model to infer task-relevant properties from 3D scenes and enable goal-directed reasoning. For model training.LychSim can also serve as a highly scalable synthetic data framework for generating post-training data that enhance various 2D and 3D spatial understanding abilities of vision-language models. Prior successes in this area, including SAT [42], ScanForgeQA [61], and SIMS-V [2], demonstrate that scalable, high-fidelity simulation can be effectively integrated into the post-training loop and substantially improve spatial understanding performance. 4.2. Adversarial Examiners Standard datasets are often limited to a narrow subset of the broader real-world parameter space. This restriction introduces bias in evaluation, such as in terms of object appearance and shape [63] or"},{"citing_arxiv_id":"2605.02881","ref_index":30,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MolmoAct2: Action Reasoning Models for Real-world Deployment","primary_cat":"cs.RO","submitted_at":"2026-05-04T17:51:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26934","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-04-29T17:48:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Distilling view-consistent future views and action-outcome supervision from a generative world model into a VLM via two-stage post-training improves dynamic spatial reasoning on SAT-Real, VSI-Bench and similar benchmarks while avoiding test-time world-model cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20570","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Exploring Spatial Intelligence from a Generative Perspective","primary_cat":"cs.CV","submitted_at":"2026-04-22T13:50:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"provides a systematic assessment across multiple spatial reasoning dimensions, including dynamic reasoning, spa- tial interaction, and perspective taking. On the methodology side, several works [15, 41, 42] aim to enhance spatial un- derstanding in MLLMs. Spatial-MLLM [34] introduces an auxiliary spatial encoder to explicitly inject 3D geometric information into the model. SAT [24] leverages simulation environments to generate large-scale rule-based spatial rea- soning data for training (real-world evaluation: SAT-Real). REVISION [3] demonstrates that data from simulated ren- dering engines (e.g., Blender) can benefit both image gen- eration and spatial understanding when used as additional guidance. Despite these advances, prior work has not ex-"},{"citing_arxiv_id":"2604.18484","ref_index":83,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments","primary_cat":"cs.CV","submitted_at":"2026-04-20T16:37:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"derstanding and physical reasoning remain bottlenecks [16,83,105]. Research ad- vances along two fronts: enhancing 3D perception (depth, layout, relations) and strengthening physical grounded inference. Early works introduced structured 3Dabstractions[28]andbenchmarkslikeSpatialVLM[12];subsequenteffortsex- tended inputs to RGB-D/3D scene graphs [16], used simulation for synthesis [83], and employed tool-assisted geometric estimation [10, 78]. Recent scaling ap- proaches leverage larger datasets for broader coverage [60,105,107,119]. For rea- soning, methods embed multimodal representations into chains of thought [53], constructcognitivemaps[74,108],predict3Dintermediateoutputs[65],integrate tools for refinement [78,101], and use RL to optimize patterns [36,62,75,90]."},{"citing_arxiv_id":"2603.27437","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-03-28T22:49:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.03944","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SCP: Spatial Causal Prediction in Video","primary_cat":"cs.CV","submitted_at":"2026-03-04T11:09:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.11635","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation","primary_cat":"cs.AI","submitted_at":"2026-02-12T06:37:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MLLMs show a large gap in spatial mathematical reasoning compared to humans, and a new 10,000-problem dataset helps narrow it through training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.04415","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training","primary_cat":"cs.CL","submitted_at":"2026-02-04T04:13:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.10941","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mull-Tokens: Modality-Agnostic Latent Thinking","primary_cat":"cs.CV","submitted_at":"2025-12-11T18:59:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.16518","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiMo-Embodied: X-Embodied Foundation Model Technical Report","primary_cat":"cs.RO","submitted_at":"2025-11-20T16:34:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.13998","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2025-08-19T16:50:01+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.23678","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Grounded Reinforcement Learning for Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2025-05-29T17:20:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.17015","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models","primary_cat":"cs.CV","submitted_at":"2025-05-22T17:59:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multi-SpatialMLLM integrates depth perception, visual correspondence, and dynamic perception into MLLMs via a 27M-sample MultiSPA dataset and benchmark, yielding gains on multi-frame spatial tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.05132","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"R1-Zero's \"Aha Moment\" in Visual Reasoning on a 2B Non-SFT Model","primary_cat":"cs.AI","submitted_at":"2025-03-07T04:21:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.07542","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Imagine while Reasoning in Space: Multimodal Visualization-of-Thought","primary_cat":"cs.CL","submitted_at":"2025-01-13T18:23:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}