{"total":16,"items":[{"citing_arxiv_id":"2605.21652","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming","primary_cat":"cs.CV","submitted_at":"2026-05-20T19:06:34+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20165","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:50:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16054","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2026-04-17T13:29:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Mind's Eye benchmark shows top multimodal LLMs score below 50% on visual abstraction, relation, and transformation tasks while humans reach 80%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12896","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs","primary_cat":"cs.CV","submitted_at":"2026-04-14T15:45:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Perception Programs rewrite dense visual tool outputs into language-native summaries, boosting MLLM accuracy by 15-45% absolute on BLINK perception tasks and setting new state-of-the-art results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11025","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images","primary_cat":"cs.CV","submitted_at":"2026-04-13T05:49:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This direction has improved performance Test-time Scaling over Perception on tasks that require localized evidence, including small-text recog- nition, chart and document understanding, dense scene interpreta- tion, and fine-grained question answering, and has also inspired related efforts on iterative grounding, external perception modules, and agent-style multimodal tool use [8, 17, 21, 26, 35]. Nevertheless, current approaches often focus on teaching models when and how to invoke tools, improving grounding accuracy on familiar distributions, or imitating successful tool-use behaviors, while remaining brittle when the correct evidence cannot be reliably localized from coarse initial perception [6, 11, 12, 38]. As a result,"},{"citing_arxiv_id":"2604.06079","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-04-07T16:58:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SciTikZer-8B uses a new dataset, benchmark, and dual self-consistency RL to generate TikZ code for scientific graphics, outperforming much larger models like Gemini-2.5-Pro.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RL withverifiablefeedback improves complex reasoning [29], leveraging algorithms like Group Relative Policy Optimization (GRPO) [38] specifically for code [17] and math [52]. Such feedback extends torender-and-comparesupervision in RRVF [ 10] and VisionR1 [21]. Related efforts leverage RL for visual reasoning: Visual Sketchpad [20] introduces visual chain-of-thought, GRIT [16] applies GRPO for grounded reasoning, and OpenThinkIMG [41] explores agentic policies. Closely related, RLRF [ 35] optimizes SVG generation via visual rewards. Unlike RLRF's focus on SVGs, we target TikZ through DSC RL to boost both structural and visual alignment. 3 Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning"},{"citing_arxiv_id":"2604.04733","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Discovering Failure Modes in Vision-Language Models using RL","primary_cat":"cs.CV","submitted_at":"2026-04-06T15:00:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An RL-based questioner agent adaptively generates queries to discover novel failure modes in VLMs without human intervention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02794","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CharTool: Tool-Integrated Visual Reasoning for Chart Understanding","primary_cat":"cs.AI","submitted_at":"2026-04-03T07:02:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.12623","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space","primary_cat":"cs.CV","submitted_at":"2025-12-14T10:07:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.03438","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents","primary_cat":"cs.AI","submitted_at":"2025-12-03T04:42:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.05271","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DeepEyesV2: Toward Agentic Multimodal Model","primary_cat":"cs.CV","submitted_at":"2025-11-07T14:31:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.04225","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images","primary_cat":"cs.CV","submitted_at":"2025-10-05T14:29:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Locate-Then-Examine improves AI-generated image detection by localizing suspicious regions first then performing region-aware re-examination, while releasing the TRACE dataset of 20k annotated images.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22746","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning","primary_cat":"cs.AI","submitted_at":"2025-09-26T04:33:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gains on visual reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.20912","ref_index":6,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning","primary_cat":"cs.AI","submitted_at":"2025-09-25T08:58:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeFacto trains multimodal models with counterfactual image variants and GRPO reinforcement learning to enforce that correct answers are supported by correct visual evidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.08827","ref_index":133,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey of Reinforcement Learning for Large Reasoning Models","primary_cat":"cs.CL","submitted_at":"2025-09-10T17:59:43+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.02547","ref_index":246,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","primary_cat":"cs.AI","submitted_at":"2025-09-02T17:46:26+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Inspired by this, researchers have begun equipping LVLMs with the ability to generate sketches or images interleaved with chain-of-thought (CoT) reasoning, enabling models to externalize intermediate representations and reason more effectively [243, 244, 245]. Visual Planning [243] proposes to use imagined image rollouts only as the CoT images thinking, using downstream task success as the reward signal. GoT- R1 [246] applies RL within the Generation-CoT framework, allowing models to autonomously discover semantic-spatial reasoning plans before producing the image. Similarly, T2I-R1 [247] explicitly decouples the process into a semantic-level CoT for high-level planning and a token-level CoT for patch-wise pixel generation, jointly optimizing both stages with RL."}],"limit":50,"offset":0}