{"total":13,"items":[{"citing_arxiv_id":"2607.00881","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniView-Space: Reinforcing Spatial Reasoning via Multi-Perspective Spatial Mapping","primary_cat":"cs.CV","submitted_at":"2026-07-01T12:45:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OmniView-Space framework with MPSM, tool-guided reasoning, and distillation achieves SOTA on spatial reasoning benchmarks for MLLMs while reducing external geometry dependencies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31257","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-30T07:33:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25802","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking VLM Representation for VLA Initialization","primary_cat":"cs.CV","submitted_at":"2026-05-25T12:51:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Experiments indicate original VLM representations are crucial for VLA performance, LoRA outperforms full finetuning, and staged robot-data pretraining yields the strongest initialization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23897","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ETCHR: Editing To Clarify and Harness Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-22T17:58:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22819","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cambrian-P: Pose-Grounded Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-21T17:59:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"To bridge this gap, several studies [11, 20, 29, 110, 111] curate spatial-oriented data by repurposing existing 3D-related datasets [ 4, 10, 26, 75, 116], applying pseudo-labeling, or designing synthetic data generation pipelines [28]. These efforts not only improve models' spatial understanding but also provide foundational datasets for future exploration. [58, 69, 110] propose to finetune MLLMs on spatial data using reinforcement learning to improve their spatial reasoning capability. Another line of research [ 29, 48, 128] introduces 3D features from off-the-shelf 3D encoders [96]. While this significantly improves MLLMs' spatial awareness, the approach remains inflexible as it is largely constrained by the quality of the pre-trained features."},{"citing_arxiv_id":"2605.22558","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-21T14:40:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GeoWeaver performs token-adaptive geometric grounding on visual tokens from a multi-level bank prior to language modeling to support better spatio-temporal reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20705","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-22T15:46:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"centric self-supervised tasks for MLLM post-training. 2. Related Work Self-Supervised Learning.Self-supervised learning (SSL) derives supervision from raw data by exploiting their in- Table 1.Comparison of key properties with self-supervised RL post-training methods.Unlike recent or concurrent methods such as Jigsaw-R1 [67], Visual Jigsaw [70], SSL4RL [23], and Spatial- SSRL [40], SSL-R1 uniquely satisfies all desired properties: it covers multiple SSL tasks, supports one-time and one-stage train- ing, and is transferable to a broad range of downstream tasks. De- sired and undesired properties are shown in green and magenta, respectively. Desired Properties [67] [70] [23] [40] Ours Covers multiple SSL tasks✗ ✗ ✓ ✓ ✓ Supports one-time and one-stage training✓ ✓ ✗ ✗ ✓"},{"citing_arxiv_id":"2604.18484","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments","primary_cat":"cs.CV","submitted_at":"2026-04-20T16:37:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and employed tool-assisted geometric estimation [10, 78]. Recent scaling ap- proaches leverage larger datasets for broader coverage [60,105,107,119]. For rea- soning, methods embed multimodal representations into chains of thought [53], constructcognitivemaps[74,108],predict3Dintermediateoutputs[65],integrate tools for refinement [78,101], and use RL to optimize patterns [36,62,75,90]. However, these methods target generic tasks, focusing on auxiliary input aug- mentation or scaled supervision. Our work addresses this gap by tailoring latent geometric and physical understanding to domain-specific driving semantics and scalability across massive real-world logs. 3 Method MLLMs excel at general vision-language tasks but are brittle in embodied en-"},{"citing_arxiv_id":"2604.18320","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations","primary_cat":"cs.CV","submitted_at":"2026-04-20T14:20:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"source models GPT-5 mini/nano [32] and GPT-4o [14]; and larger open-source MLLMs InternVL3-9B [ 50] and LLaVA-OneVision- 72B [18]. Evaluation Setting.We adopt VLMEvalKit [ 6], a widely used evaluation framework, for all assessments. We evaluate perfor- mance across diverse benchmarks covering multiple capabilities: MMStar [3] and MMVet [43] (general VQA); HallusionBench [8] (visual hallucination); MIA-Bench [28] (complex instruction follow- ing); VisuLogic [41] and MathVista (testmini subset) [26] (mathe- matical and logical reasoning); BLINK [7](visual perception); Muir- Bench [37] (multi-image understanding). In total, the evaluation suite comprises more than 10,000 test samples, comprehensively covering a wide variety of multimodal understanding and reasoning"},{"citing_arxiv_id":"2604.12966","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Boosting Visual Instruction Tuning with Self-Supervised Guidance","primary_cat":"cs.CV","submitted_at":"2026-04-14T16:59:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"lization in MLLMs have also drawn direct inspiration from SSL. For example, masked image modeling has been revisited in this context [70] (as discussed in the previous paragraph), and jigsaw puzzle solving has been adapted as a post-training objective within RLVR frameworks [73, 74], casting patch permutation prediction as a verifiable reward signal for vision-centric supervision. [33, 51] additionally introduce other pre- text tasks in a similar framework. In this work, we similarly draw inspiration from SSL V-GIFT5 Fig. 2:Visually grounded instruction-following tasks reformulated from self-supervised learning (SSL) pretext tasks. (a)Rotation prediction:the model must recognize object orientations and relate it to canonical poses."},{"citing_arxiv_id":"2604.09167","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-10T09:51:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"grounding in the 3D scene structure, VLMs are prone to hallucinations, gen- erating plausible-sounding responses that are factually disconnected from the physical environment. Thus, bridging the gap between high-level semantic rea- soning and low-level geometric grounding is essential. Recent efforts have made strides toward grounded 3D reasoning but face dis- tinctlimitations.Reasoning-orientedmethods[4,9,28,30,34,47]typicallyenhance performance through instruction tuning on 3D data. However, they rely heavily on specialized supervision and in-domain adaptation, limiting their generaliza- tion to unseen scenes. Alternatively, tool-augmented methods [17,31] leverage external perception modules for evidence collection. While promising, they of- ten depend on rigid, predefined reasoning pipelines or static tool configurations."},{"citing_arxiv_id":"2603.27437","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-03-28T22:49:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.16416","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning","primary_cat":"cs.CV","submitted_at":"2025-10-18T09:22:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SSL4RL reformulates self-supervised learning objectives into dense, verifiable reward signals for RL-based fine-tuning of vision-language models, yielding performance gains on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}