{"work":{"id":"38998646-34ee-4605-b661-ab356f16d6e5","openalex_id":null,"doi":null,"arxiv_id":"2503.06749","raw_key":null,"title":"Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models","authors":null,"authors_text":"Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao","year":2025,"venue":"cs.CV","abstract":"DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of $\\sim$6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. Scaling up the amount of multimodal math data in the RL training, Vision-R1-32B and Vison-R1-72B achieves 76.4% and 78.2% MathVista benchmark scores, respectively. The datasets and code will be released in: https://github.com/Osilly/Vision-R1 .","external_url":"https://arxiv.org/abs/2503.06749","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T04:30:20.074581+00:00","pith_arxiv_id":"2503.06749","created_at":"2026-05-09T05:55:31.149472+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models","render_title":"Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models"},"hub":{"state":{"work_id":"38998646-34ee-4605-b661-ab356f16d6e5","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":110,"external_cited_by_count":null,"distinct_field_count":9,"first_pith_cited_at":"2025-02-24T18:50:52+00:00","last_pith_cited_at":"2026-05-22T06:47:19+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-27T10:05:50.802080+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":28},{"context_role":"baseline","n":5},{"context_role":"dataset","n":1},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":28},{"context_polarity":"baseline","n":5},{"context_polarity":"use_dataset","n":1},{"context_polarity":"use_method","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models","claims":[{"claim_text":"DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a h","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"85 20.40 39.20 40.26 29.97 Gemini-3.1-Flash-Lite 55.56 23.38 43.04 40.40 40.80 49.35 40.68 Gemini-3.1-Pro 83.33 48.05 44.30 68.00 62.40 79.22 63.82 Open-Source QA Models UniVG-R1 [37] 30.56 19.74 31.65 22.13 36.00 22.08 26.22 SophiaVL-R1 [6] 38.89 23.68 22.78 27.46 34.40 37.66 29.67 VL-Rethinker [45] 33.33 23.68 31.65 27.05 33.60 29.87 29.20 Vision-R1 [46] 13.89 22.37 17.72 27.87 12.80 19.48 21.19 Open-Source General Models OneThinker-8B [39] 36.11 21.05 29.11 24.59 40.00 23.38 28.26 InternVL-3.","claim_type":"baseline","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"chart descriptions. For prompt templates used, see Section B.1. 3.2. QA Pairs with CoT Reasoning In addition to chart image, code, tabular data, and natu- ral language descriptions, we also generate question-answer (QA) pairs with long Chain-of-Thought (CoT) reasoning as part of the ChartNet dataset. This data generation process is built on the Vision-R1 framework [18]. Using pixtral-large- instruct-2411, we generate a complex multi-stage reason- ing question for each image in the ChartNet datas","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"improve generation, and then visual reasoning datasets. 6.DeepEyes[56]: is end-to-end trained with RL to think with images and interleaves the visual grounding step inside the whole reasoning process. 7.PixelReasoner[37]: adopts pixel-space reasoning (e.g. zoom and crop) and a two-phase training: fine-tuning on synthesized data, then curiosity-driven RL. 8.Vision-R1[14]: cold-starts via a synthetic dataset, then applies GRPO with a hard formatting reward and a pro- gressive thinking suppression ","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"tion and general multimodal understanding benchmarks. Code is avail- able athttps://github.com/EthenCheng/HyLaR. Keywords:Multimodal Large Language Models·Visual Latent Rea- soning·Policy Optimization 1 Introduction TheintegrationofexplicitChain-of-Thought(CoT)reasoninghasfundamentally transformed how multimodal large language models (MLLMs) approach intri- cate vision-language tasks [1,5,13,14,46,49]. However, most prevailing MLLMs suffer from a critical architectural bottleneck:early semantic ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"We evaluate GazeVLM against multiple VQA benchmarks spanning complex chart and logical reasoning (MathVista [20], ChartQA [22]), general reasoning (MMBench [19], MMStar [10], CV-Bench [31]), and high-resolution benchmarks (HRBench-4k, HRBench-8k [34]). We compare our results against a wide range of models, including closed-source models, vanilla open-source VLMs, SFT-trained reasoning models (Vision-R1-LlamaV-CI [14], SPARC [2]), and state-of-the-art RL-trained agents (Vision-R1 [14], Pixel-Reas","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"OverallOverall Counting IQ-Test JigSawRelative Reflect Spatial RelationOverall V∗D.A.V∗R.P. Proprietary Model GPT-4o [10] - 58.33 51.1 51.7 30.0 58.0 38.8 76.9 62.8 - - Open-Source Model Qwen2.5-VL-7B 7B 66.7 54.5 65.8 27.3 52.7 41.0 88.1 78.5 81.7 73.7 + vanilla SFT 7B 69.5 53.1 60.8 26.7 45.3 33.6 88.8 79.1 82.6 73.7 PAPO [27] 7B 54.3 54.8 66.7 29.3 52.0 39.6 88.8 36.1 25.2 52.6 Vision-R1 [9] 7B 46.7 42.8 51.7 26.7 27.3 44.8 66.4 70.2 70.4 69.7 PixelReasoner [23] 7B 67.0 54.5 66.7 25.3 52.7 42","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (28 contexts).","role_counts":[{"n":28,"context_role":"background"},{"n":5,"context_role":"baseline"},{"n":1,"context_role":"dataset"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-20T19:42:24.861575+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"3c1b7160-2b96-4fbb-b960-72105052c629","orcid":null,"display_name":"Wenxuan Huang"},{"id":"3de77749-736a-4461-9e3a-8b544d285b3e","orcid":null,"display_name":"Bohan Jia"},{"id":"19977cf4-5cd0-4c50-854e-6b79366b40c2","orcid":null,"display_name":"Zijie Zhai"},{"id":"d7298796-3a81-4385-be2e-3b54868b5801","orcid":null,"display_name":"Shaosheng Cao"},{"id":"e71f3878-f7ae-4e9c-a840-ed9c067ea7eb","orcid":null,"display_name":"Zheyu Ye"},{"id":"388ac9f9-022e-4450-9426-8d7ad39075d8","orcid":null,"display_name":"Fei Zhao"}]},"error":null,"updated_at":"2026-05-20T19:42:17.113621+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T14:21:40.950070+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":33},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":30},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":27},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":16},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":16},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":14},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":13},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":12},{"title":"MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning","work_id":"eda3a54e-ebd6-40bd-af17-b567ea4c5d62","shared_citers":12},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":12},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":11},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":11},{"title":"MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts","work_id":"e22c3789-9e71-4242-b6ea-3e60e06e2b66","shared_citers":11},{"title":"R1-Onevision: Ad- vancing generalized multimodal reasoning through cross-modal formalization","work_id":"bd2bf4d0-20bf-49b8-8dac-b54a8019be6c","shared_citers":11},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":10},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":10},{"title":"Visual-RFT: Visual Reinforcement Fine-Tuning","work_id":"872f09b5-998d-4a66-9a2f-f7ec2407cd62","shared_citers":10},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":9},{"title":"DeepEyes: Incentivizing \"Thinking with Images\" via Reinforcement Learning","work_id":"5f6cf57b-2407-4127-b39c-d8a61494e474","shared_citers":9},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":9},{"title":"VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model","work_id":"d36889cb-edb6-448f-9a50-36df8b1623e5","shared_citers":9},{"title":"VL-Rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning","work_id":"b13ff087-9614-48cc-8991-ad75b6543bbc","shared_citers":9},{"title":"arXiv preprint arXiv:2503.12937 , year=","work_id":"e1614961-16d2-43bc-908a-8c57da5b151c","shared_citers":8},{"title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","work_id":"bff96ab1-bd6a-4585-be23-74fdb51969c7","shared_citers":7}],"time_series":[{"n":4,"year":2025},{"n":46,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T14:21:47.409298+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T14:21:53.906912+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models","claims":[{"claim_text":"DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a h","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"85 20.40 39.20 40.26 29.97 Gemini-3.1-Flash-Lite 55.56 23.38 43.04 40.40 40.80 49.35 40.68 Gemini-3.1-Pro 83.33 48.05 44.30 68.00 62.40 79.22 63.82 Open-Source QA Models UniVG-R1 [37] 30.56 19.74 31.65 22.13 36.00 22.08 26.22 SophiaVL-R1 [6] 38.89 23.68 22.78 27.46 34.40 37.66 29.67 VL-Rethinker [45] 33.33 23.68 31.65 27.05 33.60 29.87 29.20 Vision-R1 [46] 13.89 22.37 17.72 27.87 12.80 19.48 21.19 Open-Source General Models OneThinker-8B [39] 36.11 21.05 29.11 24.59 40.00 23.38 28.26 InternVL-3.","claim_type":"baseline","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"chart descriptions. For prompt templates used, see Section B.1. 3.2. QA Pairs with CoT Reasoning In addition to chart image, code, tabular data, and natu- ral language descriptions, we also generate question-answer (QA) pairs with long Chain-of-Thought (CoT) reasoning as part of the ChartNet dataset. This data generation process is built on the Vision-R1 framework [18]. Using pixtral-large- instruct-2411, we generate a complex multi-stage reason- ing question for each image in the ChartNet datas","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"improve generation, and then visual reasoning datasets. 6.DeepEyes[56]: is end-to-end trained with RL to think with images and interleaves the visual grounding step inside the whole reasoning process. 7.PixelReasoner[37]: adopts pixel-space reasoning (e.g. zoom and crop) and a two-phase training: fine-tuning on synthesized data, then curiosity-driven RL. 8.Vision-R1[14]: cold-starts via a synthetic dataset, then applies GRPO with a hard formatting reward and a pro- gressive thinking suppression ","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"tion and general multimodal understanding benchmarks. Code is avail- able athttps://github.com/EthenCheng/HyLaR. Keywords:Multimodal Large Language Models·Visual Latent Rea- soning·Policy Optimization 1 Introduction TheintegrationofexplicitChain-of-Thought(CoT)reasoninghasfundamentally transformed how multimodal large language models (MLLMs) approach intri- cate vision-language tasks [1,5,13,14,46,49]. However, most prevailing MLLMs suffer from a critical architectural bottleneck:early semantic ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"We evaluate GazeVLM against multiple VQA benchmarks spanning complex chart and logical reasoning (MathVista [20], ChartQA [22]), general reasoning (MMBench [19], MMStar [10], CV-Bench [31]), and high-resolution benchmarks (HRBench-4k, HRBench-8k [34]). We compare our results against a wide range of models, including closed-source models, vanilla open-source VLMs, SFT-trained reasoning models (Vision-R1-LlamaV-CI [14], SPARC [2]), and state-of-the-art RL-trained agents (Vision-R1 [14], Pixel-Reas","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"OverallOverall Counting IQ-Test JigSawRelative Reflect Spatial RelationOverall V∗D.A.V∗R.P. Proprietary Model GPT-4o [10] - 58.33 51.1 51.7 30.0 58.0 38.8 76.9 62.8 - - Open-Source Model Qwen2.5-VL-7B 7B 66.7 54.5 65.8 27.3 52.7 41.0 88.1 78.5 81.7 73.7 + vanilla SFT 7B 69.5 53.1 60.8 26.7 45.3 33.6 88.8 79.1 82.6 73.7 PAPO [27] 7B 54.3 54.8 66.7 29.3 52.0 39.6 88.8 36.1 25.2 52.6 Vision-R1 [9] 7B 46.7 42.8 51.7 26.7 27.3 44.8 66.4 70.2 70.4 69.7 PixelReasoner [23] 7B 67.0 54.5 66.7 25.3 52.7 42","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (28 contexts).","role_counts":[{"n":28,"context_role":"background"},{"n":5,"context_role":"baseline"},{"n":1,"context_role":"dataset"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-20T19:42:28.272522+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models","claims":[{"claim_text":"DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a h","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T14:21:53.910924+00:00"}},"summary":{"title":"Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models","claims":[{"claim_text":"DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a h","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":33},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":30},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":27},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":16},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":16},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":14},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":13},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":12},{"title":"MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning","work_id":"eda3a54e-ebd6-40bd-af17-b567ea4c5d62","shared_citers":12},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":12},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":11},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":11},{"title":"MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts","work_id":"e22c3789-9e71-4242-b6ea-3e60e06e2b66","shared_citers":11},{"title":"R1-Onevision: Ad- vancing generalized multimodal reasoning through cross-modal formalization","work_id":"bd2bf4d0-20bf-49b8-8dac-b54a8019be6c","shared_citers":11},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":10},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":10},{"title":"Visual-RFT: Visual Reinforcement Fine-Tuning","work_id":"872f09b5-998d-4a66-9a2f-f7ec2407cd62","shared_citers":10},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":9},{"title":"DeepEyes: Incentivizing \"Thinking with Images\" via Reinforcement Learning","work_id":"5f6cf57b-2407-4127-b39c-d8a61494e474","shared_citers":9},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":9},{"title":"VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model","work_id":"d36889cb-edb6-448f-9a50-36df8b1623e5","shared_citers":9},{"title":"VL-Rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning","work_id":"b13ff087-9614-48cc-8991-ad75b6543bbc","shared_citers":9},{"title":"arXiv preprint arXiv:2503.12937 , year=","work_id":"e1614961-16d2-43bc-908a-8c57da5b151c","shared_citers":8},{"title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","work_id":"bff96ab1-bd6a-4585-be23-74fdb51969c7","shared_citers":7}],"time_series":[{"n":4,"year":2025},{"n":46,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"3de77749-736a-4461-9e3a-8b544d285b3e","orcid":null,"display_name":"Bohan Jia","source":"manual","import_confidence":0.72},{"id":"388ac9f9-022e-4450-9426-8d7ad39075d8","orcid":null,"display_name":"Fei Zhao","source":"manual","import_confidence":0.72},{"id":"d7298796-3a81-4385-be2e-3b54868b5801","orcid":null,"display_name":"Shaosheng Cao","source":"manual","import_confidence":0.72},{"id":"3c1b7160-2b96-4fbb-b960-72105052c629","orcid":null,"display_name":"Wenxuan Huang","source":"manual","import_confidence":0.72},{"id":"e71f3878-f7ae-4e9c-a840-ed9c067ea7eb","orcid":null,"display_name":"Zheyu Ye","source":"manual","import_confidence":0.72},{"id":"19977cf4-5cd0-4c50-854e-6b79366b40c2","orcid":null,"display_name":"Zijie Zhai","source":"manual","import_confidence":0.72}]}}