pith. machine review for the scientific record. sign in

arxiv: 2604.20705 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised learningreinforcement learningmultimodal large language modelsvisual puzzlesverifiable rewardspost-trainingimage understanding
0
0 comments X

The pith

Reformulating visual self-supervised tasks as verifiable puzzles supplies automatic rewards for reinforcement learning post-training of multimodal language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SSL-R1, a framework that converts standard visual self-supervised learning tasks into puzzles whose solutions can be checked automatically from the image itself. These puzzles then supply the reward signal for reinforcement learning that refines multimodal large language models after their initial training. The goal is to strengthen the models' native visual understanding and reasoning without depending on human labels or separate language-based supervision. If successful, the method would allow post-training to scale using only abundant unlabeled images while reducing the dominance of language-centric priors in current pipelines.

Core claim

SSL-R1 reformulates widely used visual self-supervised tasks into a collection of verifiable visual puzzles. These puzzles generate rewards directly from image data for RL post-training of MLLMs, requiring neither human annotations nor external model supervision. Models trained under this regime show substantial gains on multimodal understanding and reasoning benchmarks.

What carries the argument

Reformulation of visual SSL tasks into verifiable puzzles that yield image-derived rewards for reinforcement learning.

If this is right

  • MLLMs exhibit measurable gains on multimodal understanding and reasoning benchmarks after training on the visual puzzles.
  • RL post-training becomes feasible at larger scales because rewards no longer require human or external model supervision.
  • Vision-centric self-supervised signals can be used to counteract language-centric biases in MLLM training.
  • The framework supplies concrete experience for designing additional self-supervised verifiable rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be extended to video or 3D data by turning temporal or geometric SSL tasks into similarly verifiable puzzles.
  • Combining these visual rewards with existing language-based RLVR signals might produce hybrid training regimes that balance modalities more evenly.
  • Models refined this way may display improved transfer to downstream visual tasks that were never used as training puzzles.

Load-bearing premise

Rewards obtained by solving these visual puzzles will strengthen the model's general visual understanding and reasoning instead of merely teaching it to solve the specific puzzles.

What would settle it

Train an MLLM with SSL-R1 and measure its accuracy on held-out multimodal benchmarks such as visual question answering or reasoning tasks; no improvement or a drop relative to a standard baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.20705 by Alessio Tonioni, Bernt Schiele, Federico Tombari, Jiahao Xie, Nathalie Rauschmayr.

Figure 1
Figure 1. Figure 1: (a) Existing reinforcement learning with verifiable rewards (RLVR) for post-training MLLMs are supervised, requiring a large volume of high-quality language-centric multimodal data with human annotations, which is very expensive and unsustainable. (b) We propose SSL-R1, a generic self-supervised RLVR-based post-training framework that derives intrinsically verifiable rewards from input images, requiring ne… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our SSL-R1 tasks. We design five verifiable self-supervised tasks for RL post-training, ranging from the image level to the pixel level. Rotation Prediction: an image is rotated by a certain angle, and the model is tasked with predicting the angle. Visual Similarity: two augmented views are cropped from an image, with several additional views from other images, and the task is to select the mos… view at source ↗
Figure 4
Figure 4. Figure 4: The training dynamics of SSL-R1. Left: Single-task rewards. Right: Multi-task rewards. All curves are exponentially smoothed for visualization. global spatial layout transfers to downstream tasks broadly. Rotation Prediction achieves the strongest performance im￾provement on MMVP (+9.34%), indicating that identify￾ing image orientations helps understand CLIP-blind visual patterns. Visual Similarity shines … view at source ↗
Figure 5
Figure 5. Figure 5: Prompt templates for five self-supervised tasks. A format prompt (bottom) is appended at the end of each task prompt to enable the reasoning process. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of the Rotation Prediction task. The ground-truth answer for each example is provided at the bottom. A B C D A B Answer: C Answer: D C D A B Answer: B C D Select the candidate image that is most visually similar to the reference image Visual Similarity [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of the Visual Similarity task. The ground-truth answer for each example is provided at the bottom. C D A B Answer: D A B C D Answer: A A C B D Answer: B Select the candidate image that best fits the missing part of the original image Region Inpainting [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of the Region Inpainting task. The ground-truth answer for each example is provided at the bottom. 1 2 3 4 5 6 7 8 9 Answer: 3, 6, 7, 2, 1, 8, 9, 5, 4 1 2 3 4 5 6 7 8 9 Answer: 7, 6, 9, 2, 4, 8, 3, 5, 1 1 2 3 4 5 6 7 8 9 Answer: 6, 4, 7, 5, 9, 2, 8, 1, 3 Arrange the patches into the correct layout and provide the patch indices in raster-scan order Patch Ordering [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 9
Figure 9. Figure 9: Examples of the Patch Ordering task. The ground-truth answer for each example is provided at the bottom. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of the Geometric Correspondence task. The ground-truth answer for each example is provided at the bottom. Q: Does the caption describe the image correctly? Caption: A photograph of a mother goat sitting while a baby goat is standing in a field. Answer Yes or No directly. Qwen2.5-VL-7B: No SSL-R1-7B: Yes Q: Does the caption describe the image correctly? Caption: some green yellow black white and o… view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative examples on three types of vision-centric multimodal benchmarks. The wrong answers are marked in red while the correct answers are marked in green. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
read the original abstract

Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). However, the reliance on language-centric priors and expensive manual annotations prevents MLLMs' intrinsic visual understanding and scalable reward designs. In this work, we introduce SSL-R1, a generic self-supervised RL framework that derives verifiable rewards directly from images. To this end, we revisit self-supervised learning (SSL) in visual domains and reformulate widely-used SSL tasks into a set of verifiable visual puzzles for RL post-training, requiring neither human nor external model supervision. Training MLLMs on these tasks substantially improves their performance on multimodal understanding and reasoning benchmarks, highlighting the potential of leveraging vision-centric self-supervised tasks for MLLM post-training. We think this work will provide useful experience in devising effective self-supervised verifiable rewards to enable RL at scale. Project page: https://github.com/Jiahao000/SSL-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces SSL-R1, a self-supervised RL post-training framework for MLLMs. It reformulates standard visual SSL tasks (e.g., rotation prediction, jigsaw) into verifiable image-based puzzles that generate rewards directly from intrinsic image properties, without human annotations or external models. The central claim is that training MLLMs via RL on these tasks substantially improves performance on multimodal understanding and reasoning benchmarks.

Significance. If the gains prove generalizable rather than puzzle-specific, the work would meaningfully advance scalable RLVR for MLLMs by shifting reward design to vision-centric self-supervision. The approach correctly identifies the annotation bottleneck in prior RLVR methods and proposes a concrete alternative using existing SSL primitives.

major comments (1)
  1. [Experiments] Experiments section: The central claim that RL post-training on the reformulated puzzles produces transferable visual reasoning (rather than puzzle-format optimization) is load-bearing. The manuscript must include an ablation comparing full SSL-R1 RL against supervised fine-tuning on identical puzzle data; without it, benchmark gains could be explained by task exposure alone. The skeptic's concern is therefore a correctness risk that requires a concrete control experiment.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'widely-used SSL tasks' should explicitly name the tasks (rotation, jigsaw, inpainting, etc.) and the exact reformulation into question-answer format for immediate clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work and for the constructive major comment. We agree that the requested ablation is necessary to strengthen the central claim and will incorporate it in the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The central claim that RL post-training on the reformulated puzzles produces transferable visual reasoning (rather than puzzle-format optimization) is load-bearing. The manuscript must include an ablation comparing full SSL-R1 RL against supervised fine-tuning on identical puzzle data; without it, benchmark gains could be explained by task exposure alone. The skeptic's concern is therefore a correctness risk that requires a concrete control experiment.

    Authors: We agree that this control experiment is essential to rule out the possibility that gains arise merely from exposure to the puzzle formats rather than from the RL optimization itself. In the revised manuscript we will add a direct comparison of SSL-R1 (RL post-training with verifiable rewards) against supervised fine-tuning on the identical set of puzzle instances, using the same verifiable ground-truth answers as supervision targets. This ablation will be reported alongside the existing results in the Experiments section, with details on training hyperparameters and evaluation to ensure fair comparison. We believe the new results will further substantiate that the RL stage yields transferable visual reasoning improvements beyond supervised task exposure. revision: yes

Circularity Check

0 steps flagged

No circularity in the claimed derivation chain

full rationale

The paper introduces SSL-R1 by reformulating standard visual SSL tasks (e.g., rotation, jigsaw) into verifiable image-based puzzles whose rewards are computed directly from intrinsic image properties without human or external model labels. This reformulation and the subsequent RL post-training step constitute an independent methodological contribution; benchmark gains are reported as empirical outcomes rather than quantities derived by construction from fitted parameters or prior self-citations. No load-bearing uniqueness theorems, ansatzes, or self-referential definitions appear in the provided text, and the central premise does not reduce to renaming or tautological prediction of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities are identifiable.

pith-pipeline@v0.9.0 · 5486 in / 1207 out tokens · 34305 ms · 2026-05-10T01:20:48.499352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 35 canonical work pages · 22 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS,

  2. [2]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InCVPR, 2023. 2

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1

  4. [4]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 5, 7, 8

  5. [5]

    Beit: Bert pre-training of image transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. InICLR, 2022. 2

  6. [6]

    Lan- guage models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. InNeurIPS, 2020. 1, 2

  7. [7]

    Deep clustering for unsupervised learning of visual features

    Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. InECCV, 2018. 2

  8. [8]

    Unsupervised learn- ing of visual features by contrasting cluster assignments

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. In NeurIPS, 2020

  9. [9]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 2

  10. [10]

    Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024. 2, 5

  11. [11]

    Generative pre- training from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. InICML, 2020. 2

  12. [12]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. 2, 5

  13. [13]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,

  14. [14]

    Instructblip: Towards general-purpose vision-language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. InNeurIPS, 2023. 1

  15. [15]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InNAACL, 2019. 2

  16. [16]

    Unsuper- vised visual representation learning by context prediction

    Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper- vised visual representation learning by context prediction. In ICCV, 2015. 2

  17. [17]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The 9 llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  18. [18]

    Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations

    Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations. InNeurIPS,

  19. [19]

    Eva: Exploring the limits of masked visual representa- tion learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representa- tion learning at scale. InCVPR, 2023. 2

  20. [20]

    Un- supervised representation learning by predicting image rota- tions

    Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- supervised representation learning by predicting image rota- tions. InICLR, 2018. 2

  21. [21]

    Bootstrap your own latent: A new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. InNeurIPS, 2020. 2

  22. [22]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 2, 3

  23. [23]

    Ssl4rl: Revisiting self-supervised learning as intrinsic reward for visual-language reasoning

    Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chen- heng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, et al. Ssl4rl: Revisiting self-supervised learning as intrinsic reward for visual-language reasoning. arXiv preprint arXiv:2510.16416, 2025. 2

  24. [24]

    Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark

    Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark. InICML, 2025. 9

  25. [25]

    Momentum contrast for unsupervised visual rep- resentation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InCVPR, 2020. 2

  26. [26]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 2

  27. [27]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  28. [28]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2

  29. [29]

    OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

  30. [30]

    Lisa: Reasoning segmenta- tion via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmenta- tion via large language model. InCVPR, 2024. 2, 5

  31. [31]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post- training.arXiv preprint arXiv:2411.15124, 2024. 1

  32. [32]

    Otter: A Multi-Modal Model with In-Context Instruction Tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.arXiv preprint arXiv:2305.03726, 2023. 1

  33. [33]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML,

  34. [34]

    Correlational image modeling for self-supervised visual pre-training

    Wei Li, Jiahao Xie, and Chen Change Loy. Correlational image modeling for self-supervised visual pre-training. In CVPR, 2023. 2

  35. [35]

    Improved visual-spatial reasoning via r1-zero-like training.arXiv preprint arXiv:2504.00883, 2025

    Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, and Zhijie Deng. Improved visual-spatial reasoning via r1-zero-like training.arXiv preprint arXiv:2504.00883, 2025. 2

  36. [36]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 5

  37. [37]

    Visual spatial reasoning.TACL, 2023

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.TACL, 2023. 2, 5

  38. [38]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1

  39. [39]

    Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024. 2, 5

  40. [40]

    Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

    Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025. 2

  41. [41]

    Visual- rft: Visual reinforcement fine-tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning. InICCV, 2025. 2

  42. [42]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

  43. [43]

    Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. In ICLR, 2024. 9

  44. [44]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning.arXiv preprint arXiv:2503.07365, 2025. 1, 2

  45. [45]

    The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai

    AI Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai. meta. 10 com/blog/llama-4-multimodal-intelligence/, checked on, 4 (7):2025, 2025. 1

  46. [46]

    Unsupervised learning of visual representations by solving jigsaw puzzles

    Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. InECCV,

  47. [47]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

  48. [48]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2

  49. [49]

    Training lan- guage models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. InNeurIPS, 2022. 3

  50. [50]

    Context encoders: Feature learning by inpainting

    Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. InCVPR, 2016. 2

  51. [51]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 2

  52. [52]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3

  53. [53]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3

  54. [54]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2

  55. [55]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 1

  56. [56]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 1

  57. [57]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. 1

  58. [58]

    Winoground: Probing vision and language models for visio- linguistic compositionality

    Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio- linguistic compositionality. InCVPR, 2022. 2, 6

  59. [59]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024. 2, 5

  60. [60]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2

  61. [61]

    Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning

    Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning. In NeurIPS, 2025. 2

  62. [62]

    Mea- suring multimodal mathematical reasoning with math-vision dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset. InNeurIPS, 2024. 9

  63. [63]

    Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

    Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In AAAI, 2025. 2, 5

  64. [64]

    Llava-critic-r1: Your critic model is secretly a strong policy model.CoRR, abs/2509.00676, 2025

    Xiyao Wang, Chunyuan Li, Jianwei Yang, Kai Zhang, Bo Liu, Tianyi Xiong, and Furong Huang. Llava-critic-r1: Your critic model is secretly a strong policy model.arXiv preprint arXiv:2509.00676, 2025. 2, 5, 7

  65. [65]

    Vicrit: A verifiable rein- forcement learning proxy task for visual perception in vlms

    Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, et al. Vicrit: A verifiable rein- forcement learning proxy task for visual perception in vlms. InNeurIPS, 2025. 2

  66. [66]

    Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

    Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Lin- jie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Li- juan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025. 1, 2, 5, 7, 8

  67. [67]

    Jigsaw-r1: A study of rule-based visual reinforcement learning with jigsaw puz- zles.arXiv preprint arXiv:2505.23590, 2025

    Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, and Matthew B Blaschko. Jigsaw-r1: A study of rule-based visual reinforcement learning with jigsaw puz- zles.arXiv preprint arXiv:2505.23590, 2025. 2

  68. [68]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, et al. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025. 2

  69. [69]

    V?: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InCVPR, 2024. 2, 5

  70. [70]

    Visual jigsaw post-training improves mllms

    Penghao Wu, Yushan Zhang, Haiwen Diao, Bo Li, Lewei Lu, and Ziwei Liu. Visual jigsaw post-training improves mllms. arXiv preprint arXiv:2509.25190, 2025. 2, 5, 7, 8, 9

  71. [71]

    Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

    LLM-Core-Team Xiaomi. Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025. 9

  72. [72]

    Unsupervised object-level representation learning from scene images

    Jiahao Xie, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, and Chen Change Loy. Unsupervised object-level representation learning from scene images. InNeurIPS, 2021. 2

  73. [73]

    Delving into inter-image invariance for unsupervised visual representations.IJCV, 2022

    Jiahao Xie, Xiaohang Zhan, Ziwei Liu, Yew-Soon Ong, and Chen Change Loy. Delving into inter-image invariance for unsupervised visual representations.IJCV, 2022. 11

  74. [74]

    Masked frequency modeling for self-supervised visual pre-training

    Jiahao Xie, Wei Li, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, and Chen Change Loy. Masked frequency modeling for self-supervised visual pre-training. InICLR, 2023. 2

  75. [75]

    Depth any- thing v2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2. InNeurIPS, 2024. 2, 5

  76. [76]

    How to evaluate the generalization of detection? a bench- mark for comprehensive open-vocabulary detection

    Yiyang Yao, Peng Liu, Tiancheng Zhao, Qianqian Zhang, Jiajia Liao, Chunxin Fang, Kyusong Lee, and Qing Wang. How to evaluate the generalization of detection? a bench- mark for comprehensive open-vocabulary detection. In AAAI, 2024. 2, 5

  77. [77]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023. 1

  78. [78]

    arXiv preprint arXiv:2504.07954 , year =

    En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954,

  79. [79]

    Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning

    Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, et al. Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning. arXiv preprint arXiv:2507.22607, 2025. 1, 2, 5, 7

  80. [80]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InCVPR, 2024. 9

Showing first 80 references.