{"total":29,"items":[{"citing_arxiv_id":"2607.01707","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LASER: A Corrective Lens for LVLMs via Visual Attention Preservation and Sink Suppression","primary_cat":"cs.CV","submitted_at":"2026-07-02T04:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LASER uses Visual Grounding Reward and Sink Suppression Reward to preserve visual attention trajectories and suppress sink tokens, reducing visual forgetting in LVLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31599","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-06-30T12:47:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViToS uses dual-stream RL with cross-feedback optimization to prune medical image tokens to 77% length while reporting 108.27% and 104.16% relative performance on two 7B VLMs across seven benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28845","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Personalizing MLLMs via Reinforced Multimodal Reference Game","primary_cat":"cs.CV","submitted_at":"2026-06-27T10:12:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RRG trains MLLMs via a reinforced multimodal reference game with contrastive rewards on hard positives and negatives to produce accurate, discriminative concept descriptions, achieving SOTA on personalization benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22158","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Improving Reasoning in Vision-Language Models via Perception Verified Self-Training","primary_cat":"cs.CV","submitted_at":"2026-06-20T17:33:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Perception-verified self-training with PerceptEval and two-stage curriculum learning improves VLM reasoning by up to 16% over standard self-training baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09711","ref_index":284,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization","primary_cat":"cs.AI","submitted_at":"2026-06-08T16:32:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07812","ref_index":119,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Participation in Modular AI Systems","primary_cat":"cs.AI","submitted_at":"2026-06-05T19:39:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Modular AI systems assembled from contributed small models outperform monolithic LLMs by up to 15.4% on 15 tasks including reasoning and factuality while showing emergent problem-solving and benefits from contributor diversity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00564","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding","primary_cat":"cs.CV","submitted_at":"2026-05-30T06:34:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Decomposes VLM distillation loss into orthogonal language and visual components and introduces Visual Gradient Steering to prioritize visual grounding over standard monolithic optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30257","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:20:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stable-Layers applies Flow-GRPO with LoRA and a two-stage VLM scoring pipeline to improve layer decomposition without paired supervision, yielding stronger separation and lower reconstruction error on Crello.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28070","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information","primary_cat":"cs.AI","submitted_at":"2026-05-27T07:28:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"JTS trains reasoning models via supervised warm-up and missing-premise RL to make an explicit answerability commitment that triggers early termination on unanswerable inputs, raising Abstention@Detection near saturation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21931","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-21T03:00:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21924","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Visual-Advantage On-Policy Distillation for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-21T02:48:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20914","ref_index":22,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RISE: Reliable Improvement in Self-Evolving Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T08:57:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RISE proposes a self-evolving VLM framework with three designs to address challenges in question generation and solver adaptation, reporting consistent gains on seven benchmarks across two backbones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20165","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:50:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19342","ref_index":8,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Semantic-Enriched Latent Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-19T04:29:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SLVR is a two-stage method that enriches region-centric latent representations with fine-grained attribute semantics and aligns them via M-GRPO across multiple queries on the same region, supported by new SLV-Set dataset and SV-QA benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13467","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-13T12:55:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09269","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification","primary_cat":"cs.CL","submitted_at":"2026-05-10T02:32:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tional Linguistics: NAACL 2025, pages 1755-1797, 2025. [21] Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, et al. Vl-rewardbench: a challenging benchmark for vision-language generative reward models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24657-24668, 2025. [22] Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652, 2025. [23] Rui Liu, Dian Yu, Lei Ke, Haolin Liu, Yujun Zhou, Zhenwen Liang, Haitao Mi, Pratap Tokekar, and Dong Yu. Stable and efficient single-rollout rl for multimodal reasoning."},{"citing_arxiv_id":"2605.09262","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reinforcing Multimodal Reasoning Against Visual Degradation","primary_cat":"cs.CV","submitted_at":"2026-05-10T02:17:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. T\\\" ulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024. [12] Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884-19895, 2020. [13] Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652, 2025. [14] Haolin Liu, Dian Yu, Sidi Lu, Yujun Zhou, Rui Liu, Zhenwen Liang, Haitao Mi, Chen-Yu Wei, and Dong Yu. Save the good prefix: Precise error penalization via process-supervised rl to"},{"citing_arxiv_id":"2605.08816","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?","primary_cat":"cs.AI","submitted_at":"2026-05-09T09:10:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stronger VLM agents use mirror reflections for self-identification in controlled 3D tests, while weaker ones inspect but fail to extract or correctly attribute self-relevant information.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"reasoning [38]. In embodied settings, they can map visual observations and instructions into high- level plans or actions, but strong multimodal performance does not necessarily imply grounded visual reasoning: VLMs still suffer from hallucination, weak modality alignment, textual priors, and failures on spatially simple or perceptually grounded tasks [ 14, 16, 25, 32, 34, 40]. Recent work improves grounding through perception-language decomposition, self-rewarding, calibration, navigation history, and action-conditioned reasoning [12, 14, 32, 35, 40]. Our work shifts this focus from grounding external objects to grounding self-relevant visual information. Multimodal Benchmarks, Shortcuts, and Process-Level Evaluation."},{"citing_arxiv_id":"2605.06121","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-05-07T12:30:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks QFSD and AgriInsect.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20987","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks","primary_cat":"cs.AI","submitted_at":"2026-04-22T18:17:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18493","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data","primary_cat":"cs.LG","submitted_at":"2026-04-20T16:43:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18320","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations","primary_cat":"cs.CV","submitted_at":"2026-04-20T14:20:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"Florence, and Andy Zeng. 2023. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA). IEEE, 9493-9500. [23] Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, and Junxian He. 2024. Diving into self-evolving training for multimodal reasoning.arXiv preprint arXiv:2412.17451(2024). [24] Wei Liu, Siya Qi, Yali Du, and Yulan He. 2026. Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain.arXiv preprint arXiv:2603.02218(2026). [25] Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. 2025. Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning."},{"citing_arxiv_id":"2604.13602","ref_index":170,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges","primary_cat":"cs.LG","submitted_at":"2026-04-15T08:11:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Strict maximization of proxy re- wards often prioritizes metric exploitation over genuine quality. This manifests in three dimensions. (1) Visual and Structural Degradation:Models may break visual fidelity, producing low-level pixel artifacts or high- level geometric distortion. Low-level issues include oversaturated colors, artificial textures, and high-frequency grid patterns [170, 171, 182, 209, 210]. High-level issues involve physical implausibility, such as asymmetrical facial 24 Reward Hacking in the Era of Large Models Fudan NLP Group features or illogical object duplication [ 181, 211, 212]. In 3D generation, the \"Janus problem\"-where an object displays multiple front faces-is a classic instance ofevaluator-level exploitation: the generator reverse-engineers"},{"citing_arxiv_id":"2604.02467","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation","primary_cat":"cs.CV","submitted_at":"2026-04-02T18:58:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while improving framing and prompt adherence.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"fidelity and text-visual consistency by distilling human-like preferences into diffusion models [1,34,64]. Works like Control-A-Video [4] further demonstrate reward feedback learning for controllable video diffusion, using visual reward signals derived from rendered frames. Self-rewarding and AI-generated preference signals further reduce dependence on costly labels and help mitigate reward hacking [28,29,39]. Recent Vision-Language Models (VLMs) also advance video understanding with finer temporal reasoning and narrative coherence [38,49], and benchmark suites such as ShotBench and VEU-Bench [26,31] probe cin- ematographic perception including shot scale, camera movement, and editing style. However, unlike images or videos where pixel-level visual feedback is readily"},{"citing_arxiv_id":"2603.24935","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-03-26T01:56:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.01993","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection","primary_cat":"cs.CV","submitted_at":"2026-03-02T15:45:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"REFORM is a three-stage reasoning curriculum plus the ROM dataset that achieves state-of-the-art generalization on multimodal manipulation detection benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.11113","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VIDEOP2R: Video Understanding from Perception to Reasoning","primary_cat":"cs.CV","submitted_at":"2025-11-14T09:42:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.20912","ref_index":10,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning","primary_cat":"cs.AI","submitted_at":"2025-09-25T08:58:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeFacto trains multimodal models with counterfactual image variants and GRPO reinforcement learning to enforce that correct answers are supported by correct visual evidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.02547","ref_index":243,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","primary_cat":"cs.AI","submitted_at":"2025-09-02T17:46:26+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"their most powerful cognitive abilities-imagination-to produce sketches or diagrams that aid problem- solving. Inspired by this, researchers have begun equipping LVLMs with the ability to generate sketches or images interleaved with chain-of-thought (CoT) reasoning, enabling models to externalize intermediate representations and reason more effectively [243, 244, 245]. Visual Planning [243] proposes to use imagined image rollouts only as the CoT images thinking, using downstream task success as the reward signal. GoT- R1 [246] applies RL within the Generation-CoT framework, allowing models to autonomously discover semantic-spatial reasoning plans before producing the image. Similarly, T2I-R1 [247] explicitly decouples"}],"limit":50,"offset":0}