{"total":19,"items":[{"citing_arxiv_id":"2607.01050","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GeoSearcher: Anchor-Guided Progressive Reasoning for Remote Sensing Visual Grounding with Process Supervision","primary_cat":"cs.CV","submitted_at":"2026-07-01T15:12:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GeoSearcher introduces anchor-centric reasoning supervised fine-tuning and process-faithful group relative policy optimization to improve MLLM-based remote sensing visual grounding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28266","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RSICCLLM: A Multimodal Large Language Model for Remote Sensing Image Change Captioning","primary_cat":"cs.CV","submitted_at":"2026-06-26T16:57:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RSICCLLM introduces a post-training framework with RSICI dataset, difference-aware supervised fine-tuning, and dual-negative preference optimization that claims to outperform much larger models on remote sensing image change captioning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25437","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Does Seeing More Mean Knowing More? Mono-Anchored Advantage Normalization for Multi-Source Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-25T05:29:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MARS introduces mono-anchored advantage normalization to quantify information gain from multi-source integration in RLVR, yielding 3.2% and 4.9% gains on GRPO and DAPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18641","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Leveraging Latent Visual Reasoning in Silence","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:46:02+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Latent visual reasoning improves multimodal models via training effects even without using latent tokens at inference, enabled by an attention-based RL reward that promotes interaction with text tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17486","ref_index":145,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization","primary_cat":"cs.RO","submitted_at":"2026-05-17T14:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15951","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding","primary_cat":"cs.CV","submitted_at":"2026-05-15T13:41:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13156","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dual-Pathway Circuits of Object Hallucination in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-13T08:20:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"show that the identified circuits are consistent across architectures, support causal intervention, and transfer selectively across hallucination types. 1 Introduction Vision-Language Models (VLMs) demonstrate strong visual reasoning [ 36] and cross-modal un- derstanding [6, 26]. Representative systems such as Qwen-VL [1, 2] and GPT-4 [22] consequently deliver strong performance on image captioning [38], visual grounding [3], and video understand- ing [37]. However, VLMs often produce object hallucinations, describing entities, attributes, or relations absent from the input image. Such outputs reflect over-reliance on text-centric priors rather than visual evidence, limiting model reliability and interpretability. Among hallucination types, object-existence questions provide a particularly clean setting: their binary ground truth directly tests"},{"citing_arxiv_id":"2605.12497","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Web to Pixels: Bringing Agentic Search into Visual Perception","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"60 2.44 0.47 Gemini-3.1-Flash-Lite 53.57 63.89 26.05 23.38 30.94 22.78 15.08 18.00 37.71 37.60 40.33 41.56 27.90 28.42 Gemini-3.1-Pro 62.89 75.00 32.13 33.77 24.14 22.78 22.22 29.60 31.64 32.00 45.43 53.25 30.52 35.09 Open-Source Grounding Models Perception-R1 [4] 16.07 11.11 9.61 6.58 22.16 21.52 4.08 4.00 23.50 22.40 13.15 12.99 14.76 13.10 UniVG-R1 [37] 29.29 33.33 19.02 22.37 31.45 31.65 14.63 19.92 34.68 37.60 25.67 28.57 25.79 28.91 Ground-R1 [38] 44.47 50.00 24.11 28.95 24.23 24.05 9.60 13.67 32.41 32.80 25.23 28.57 26.68 29.67 Open-Source General Models OneThinker-8B [39] 51.1961.11 28.19 32.8923.86 24.05 21.81 32.8137.71 42.4033.92 38.96 32.78 38.70 InternVL-3.5-8B [40] 10.58 2.78 3.49 0.00 3."},{"citing_arxiv_id":"2605.23954","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-11T06:30:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EchoDistill applies noisy-to-clean self-distillation with GRPO to boost Audio LLM robustness, reporting 4.18% average GSR gains under strong noise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07334","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-08T06:39:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Deepseek-R1 [9] introduces group relative policy optimization (GRPO), which leverages verifiable reward signals to estimate relative advantages among responses, thereby sub- stantially improving reasoning. Building on this, GRPO fine-tuning techniques have been extended to multimodal tasks, covering areas such as image spatial reasoning [ 16, 23], video understand- ing [6, 12, 24], multi-image localization [2, 31], and visual generation [ 5, 28, 29], demonstrating its strong adaptability in complex multimodal scenarios. Early work, such as Seg-Zero [ 14] and VisionReasoner [15], designed task-specific rewards for segmentation, improving image-level reason- ing and mask quality. Building on this idea, Omni-R1 [34] and Veason-R1 [7] adopted GRPO for Ref-A VS and VRS."},{"citing_arxiv_id":"2605.02735","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs","primary_cat":"cs.LG","submitted_at":"2026-05-04T15:36:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18484","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments","primary_cat":"cs.CV","submitted_at":"2026-04-20T16:37:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial reasoning and embodied performance on 18 benchmarks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"RMSE SURDSVLADBench STRIDE- QA Proprietary Models GPT-4o [73] 52.70 19.20 13.31 56.00 14.70 Gemini-1.5 [91] 53.50 19.62 32.77 54.23 - Open-Source Models Qwen2.5-VL-7B [5] 41.10 30.36 12.61 50.75 1.41 Qwen2.5-VL-32B [5] 57.30 15.87 38.82 61.33 6.58 Qwen3-VL-A3B-30B [4] 53.15 13.84 38.69 59.81 8.69 Qwen3-VL-32B [4]59.9315.70 41.53 61.64 6.00 Spatial Models UniVG-R1-7B [6] 46.82 53.85 1.96 46.51 4.61 PR1-OCR-2B [110] 38.88 11.60 13.81 44.65 8.18 PR1-Detection-3B [110] 36.74 62.61 33.84 56.08 4.35 PR1-Counting-2B [110] 40.32 51.95 13.69 48.52 1.94 PR1-Grounding-2B [110] 39.54 12.34 13.96 43.67 7.15 Embodied Models DriveMM [39] 49.73 12.68 8.73 47.45 1.12 Cosmos-R1 [2] 45.62 23.41 10.05 55.93 1.03 Mimo-Embodied [31] 53."},{"citing_arxiv_id":"2604.18665","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track","primary_cat":"cs.SD","submitted_at":"2026-04-20T14:40:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-VOS into sequential verification and refinement steps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22836","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method","primary_cat":"cs.CV","submitted_at":"2026-04-20T14:36:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"An agent-augmented Sa2VA pipeline for referring video object segmentation placed third in the MeViS-Text track of the 5th PVUW Challenge by adding verification, search, and refinement stages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17488","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AutoVQA-G: Self-Improving Agentic Framework for Automated Visual Question Answering and Grounding Annotation","primary_cat":"cs.CV","submitted_at":"2026-04-19T15:22:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AutoVQA-G is a self-improving framework that generates VQA-G datasets with higher visual grounding accuracy than leading multimodal LLMs via iterative CoT verification and prompt refinement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08879","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification","primary_cat":"cs.CL","submitted_at":"2026-04-10T02:38:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRASP improves multimodal sarcasm target identification by anchoring visual regions in grounded chain-of-thought reasoning and using dual-stage optimization on a new balanced dataset.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Furthermore, CoT reasoning [34] has been introduced to facilitate structured textual deductions. Recently, Su et al. [27] proposed the \"Thinking with Im- ages\" paradigm, which enables models to utilize visual information as intermediate reasoning steps, thereby transforming vision from a passive input into a dynamic, actionable cognitive workspace [3, 12]. Departing from previous works that evaluate them in constrained zero-shot settings [ 14, 33] or merely utilize MLLMs as external knowledge generators, we proposeGRASP. It performs end-to-end training on MLLMs using our curatedMSTI-MAXdataset, tailored for MSD and MSTI. 2.2 Multimodal Sarcasm Target Identification Multimodal Sarcasm Detection (MSD) [25] aims to determine whether"},{"citing_arxiv_id":"2512.16918","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AdaTooler-V: Adaptive Tool-Use for Images and Videos","primary_cat":"cs.CV","submitted_at":"2025-12-18T18:59:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.03963","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2025-12-03T16:57:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TempR1 applies temporal-aware multi-task RL using GRPO and three types of localization rewards to achieve SOTA temporal understanding in MLLMs with synergistic gains from joint optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.18719","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2025-05-24T14:42:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}