arxiv: 2604.20705 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

Jiahao Xie , Alessio Tonioni , Nathalie Rauschmayr , Federico Tombari , Bernt Schiele

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords self-supervised learningreinforcement learningmultimodal large language modelsvisual puzzlesverifiable rewardspost-trainingimage understanding

0 comments

The pith

Reformulating visual self-supervised tasks as verifiable puzzles supplies automatic rewards for reinforcement learning post-training of multimodal language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SSL-R1, a framework that converts standard visual self-supervised learning tasks into puzzles whose solutions can be checked automatically from the image itself. These puzzles then supply the reward signal for reinforcement learning that refines multimodal large language models after their initial training. The goal is to strengthen the models' native visual understanding and reasoning without depending on human labels or separate language-based supervision. If successful, the method would allow post-training to scale using only abundant unlabeled images while reducing the dominance of language-centric priors in current pipelines.

Core claim

SSL-R1 reformulates widely used visual self-supervised tasks into a collection of verifiable visual puzzles. These puzzles generate rewards directly from image data for RL post-training of MLLMs, requiring neither human annotations nor external model supervision. Models trained under this regime show substantial gains on multimodal understanding and reasoning benchmarks.

What carries the argument

Reformulation of visual SSL tasks into verifiable puzzles that yield image-derived rewards for reinforcement learning.

If this is right

MLLMs exhibit measurable gains on multimodal understanding and reasoning benchmarks after training on the visual puzzles.
RL post-training becomes feasible at larger scales because rewards no longer require human or external model supervision.
Vision-centric self-supervised signals can be used to counteract language-centric biases in MLLM training.
The framework supplies concrete experience for designing additional self-supervised verifiable rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be extended to video or 3D data by turning temporal or geometric SSL tasks into similarly verifiable puzzles.
Combining these visual rewards with existing language-based RLVR signals might produce hybrid training regimes that balance modalities more evenly.
Models refined this way may display improved transfer to downstream visual tasks that were never used as training puzzles.

Load-bearing premise

Rewards obtained by solving these visual puzzles will strengthen the model's general visual understanding and reasoning instead of merely teaching it to solve the specific puzzles.

What would settle it

Train an MLLM with SSL-R1 and measure its accuracy on held-out multimodal benchmarks such as visual question answering or reasoning tasks; no improvement or a drop relative to a standard baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.20705 by Alessio Tonioni, Bernt Schiele, Federico Tombari, Jiahao Xie, Nathalie Rauschmayr.

**Figure 1.** Figure 1: (a) Existing reinforcement learning with verifiable rewards (RLVR) for post-training MLLMs are supervised, requiring a large volume of high-quality language-centric multimodal data with human annotations, which is very expensive and unsustainable. (b) We propose SSL-R1, a generic self-supervised RLVR-based post-training framework that derives intrinsically verifiable rewards from input images, requiring ne… view at source ↗

**Figure 2.** Figure 2: Overview of our SSL-R1 tasks. We design five verifiable self-supervised tasks for RL post-training, ranging from the image level to the pixel level. Rotation Prediction: an image is rotated by a certain angle, and the model is tasked with predicting the angle. Visual Similarity: two augmented views are cropped from an image, with several additional views from other images, and the task is to select the mos… view at source ↗

**Figure 4.** Figure 4: The training dynamics of SSL-R1. Left: Single-task rewards. Right: Multi-task rewards. All curves are exponentially smoothed for visualization. global spatial layout transfers to downstream tasks broadly. Rotation Prediction achieves the strongest performance improvement on MMVP (+9.34%), indicating that identifying image orientations helps understand CLIP-blind visual patterns. Visual Similarity shines … view at source ↗

**Figure 5.** Figure 5: Prompt templates for five self-supervised tasks. A format prompt (bottom) is appended at the end of each task prompt to enable the reasoning process. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of the Rotation Prediction task. The ground-truth answer for each example is provided at the bottom. A B C D A B Answer: C Answer: D C D A B Answer: B C D Select the candidate image that is most visually similar to the reference image Visual Similarity [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Examples of the Visual Similarity task. The ground-truth answer for each example is provided at the bottom. C D A B Answer: D A B C D Answer: A A C B D Answer: B Select the candidate image that best fits the missing part of the original image Region Inpainting [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of the Region Inpainting task. The ground-truth answer for each example is provided at the bottom. 1 2 3 4 5 6 7 8 9 Answer: 3, 6, 7, 2, 1, 8, 9, 5, 4 1 2 3 4 5 6 7 8 9 Answer: 7, 6, 9, 2, 4, 8, 3, 5, 1 1 2 3 4 5 6 7 8 9 Answer: 6, 4, 7, 5, 9, 2, 8, 1, 3 Arrange the patches into the correct layout and provide the patch indices in raster-scan order Patch Ordering [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 9.** Figure 9: Examples of the Patch Ordering task. The ground-truth answer for each example is provided at the bottom. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Examples of the Geometric Correspondence task. The ground-truth answer for each example is provided at the bottom. Q: Does the caption describe the image correctly? Caption: A photograph of a mother goat sitting while a baby goat is standing in a field. Answer Yes or No directly. Qwen2.5-VL-7B: No SSL-R1-7B: Yes Q: Does the caption describe the image correctly? Caption: some green yellow black white and o… view at source ↗

**Figure 11.** Figure 11: Qualitative examples on three types of vision-centric multimodal benchmarks. The wrong answers are marked in red while the correct answers are marked in green. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

read the original abstract

Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). However, the reliance on language-centric priors and expensive manual annotations prevents MLLMs' intrinsic visual understanding and scalable reward designs. In this work, we introduce SSL-R1, a generic self-supervised RL framework that derives verifiable rewards directly from images. To this end, we revisit self-supervised learning (SSL) in visual domains and reformulate widely-used SSL tasks into a set of verifiable visual puzzles for RL post-training, requiring neither human nor external model supervision. Training MLLMs on these tasks substantially improves their performance on multimodal understanding and reasoning benchmarks, highlighting the potential of leveraging vision-centric self-supervised tasks for MLLM post-training. We think this work will provide useful experience in devising effective self-supervised verifiable rewards to enable RL at scale. Project page: https://github.com/Jiahao000/SSL-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SSL-R1 maps visual SSL tasks to RL puzzles for MLLM post-training without labels, but the abstract alone leaves the generalization claims untested.

read the letter

The core move here is turning standard visual self-supervised tasks—rotation prediction, jigsaw, inpainting—into answerable questions that supply automatic, image-intrinsic rewards for RL post-training of MLLMs. This sidesteps both human annotations and language-heavy priors, which is the main practical pitch. The abstract positions it as a generic framework rather than a one-off trick, and that framing is new enough to note even if the individual SSL tasks are old.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces SSL-R1, a self-supervised RL post-training framework for MLLMs. It reformulates standard visual SSL tasks (e.g., rotation prediction, jigsaw) into verifiable image-based puzzles that generate rewards directly from intrinsic image properties, without human annotations or external models. The central claim is that training MLLMs via RL on these tasks substantially improves performance on multimodal understanding and reasoning benchmarks.

Significance. If the gains prove generalizable rather than puzzle-specific, the work would meaningfully advance scalable RLVR for MLLMs by shifting reward design to vision-centric self-supervision. The approach correctly identifies the annotation bottleneck in prior RLVR methods and proposes a concrete alternative using existing SSL primitives.

major comments (1)

[Experiments] Experiments section: The central claim that RL post-training on the reformulated puzzles produces transferable visual reasoning (rather than puzzle-format optimization) is load-bearing. The manuscript must include an ablation comparing full SSL-R1 RL against supervised fine-tuning on identical puzzle data; without it, benchmark gains could be explained by task exposure alone. The skeptic's concern is therefore a correctness risk that requires a concrete control experiment.

minor comments (1)

[Abstract] Abstract: The phrase 'widely-used SSL tasks' should explicitly name the tasks (rotation, jigsaw, inpainting, etc.) and the exact reformulation into question-answer format for immediate clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work and for the constructive major comment. We agree that the requested ablation is necessary to strengthen the central claim and will incorporate it in the revised manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim that RL post-training on the reformulated puzzles produces transferable visual reasoning (rather than puzzle-format optimization) is load-bearing. The manuscript must include an ablation comparing full SSL-R1 RL against supervised fine-tuning on identical puzzle data; without it, benchmark gains could be explained by task exposure alone. The skeptic's concern is therefore a correctness risk that requires a concrete control experiment.

Authors: We agree that this control experiment is essential to rule out the possibility that gains arise merely from exposure to the puzzle formats rather than from the RL optimization itself. In the revised manuscript we will add a direct comparison of SSL-R1 (RL post-training with verifiable rewards) against supervised fine-tuning on the identical set of puzzle instances, using the same verifiable ground-truth answers as supervision targets. This ablation will be reported alongside the existing results in the Experiments section, with details on training hyperparameters and evaluation to ensure fair comparison. We believe the new results will further substantiate that the RL stage yields transferable visual reasoning improvements beyond supervised task exposure. revision: yes

Circularity Check

0 steps flagged

No circularity in the claimed derivation chain

full rationale

The paper introduces SSL-R1 by reformulating standard visual SSL tasks (e.g., rotation, jigsaw) into verifiable image-based puzzles whose rewards are computed directly from intrinsic image properties without human or external model labels. This reformulation and the subsequent RL post-training step constitute an independent methodological contribution; benchmark gains are reported as empirical outcomes rather than quantities derived by construction from fitted parameters or prior self-citations. No load-bearing uniqueness theorems, ansatzes, or self-referential definitions appear in the provided text, and the central premise does not reduce to renaming or tautological prediction of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities are identifiable.

pith-pipeline@v0.9.0 · 5486 in / 1207 out tokens · 34305 ms · 2026-05-10T01:20:48.499352+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 35 canonical work pages · 22 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS,
[2]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InCVPR, 2023. 2

2023
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1

work page internal anchor Pith review arXiv 2023
[4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 5, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Beit: Bert pre-training of image transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. InICLR, 2022. 2

2022
[6]

Lan- guage models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. InNeurIPS, 2020. 1, 2

2020
[7]

Deep clustering for unsupervised learning of visual features

Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. InECCV, 2018. 2

2018
[8]

Unsupervised learn- ing of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. In NeurIPS, 2020

2020
[9]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 2

2021
[10]

Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? InNeurIPS, 2024. 2, 5

2024
[11]

Generative pre- training from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. InICML, 2020. 2

2020
[12]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. 2, 5

2020
[13]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,
[14]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. InNeurIPS, 2023. 1

2023
[15]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InNAACL, 2019. 2

2019
[16]

Unsuper- vised visual representation learning by context prediction

Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper- vised visual representation learning by context prediction. In ICCV, 2015. 2

2015
[17]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The 9 llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations. InNeurIPS,
[19]

Eva: Exploring the limits of masked visual representa- tion learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representa- tion learning at scale. InCVPR, 2023. 2

2023
[20]

Un- supervised representation learning by predicting image rota- tions

Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- supervised representation learning by predicting image rota- tions. InICLR, 2018. 2

2018
[21]

Bootstrap your own latent: A new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. InNeurIPS, 2020. 2

2020
[22]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Ssl4rl: Revisiting self-supervised learning as intrinsic reward for visual-language reasoning

Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chen- heng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, et al. Ssl4rl: Revisiting self-supervised learning as intrinsic reward for visual-language reasoning. arXiv preprint arXiv:2510.16416, 2025. 2

work page arXiv 2025
[24]

Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark

Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark. InICML, 2025. 9

2025
[25]

Momentum contrast for unsupervised visual rep- resentation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InCVPR, 2020. 2

2020
[26]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 2

2022
[27]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review arXiv
[28]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

work page arXiv
[30]

Lisa: Reasoning segmenta- tion via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmenta- tion via large language model. InCVPR, 2024. 2, 5

2024
[31]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post- training.arXiv preprint arXiv:2411.15124, 2024. 1

work page internal anchor Pith review arXiv 2024
[32]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.arXiv preprint arXiv:2305.03726, 2023. 1

work page internal anchor Pith review arXiv 2023
[33]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML,
[34]

Correlational image modeling for self-supervised visual pre-training

Wei Li, Jiahao Xie, and Chen Change Loy. Correlational image modeling for self-supervised visual pre-training. In CVPR, 2023. 2

2023
[35]

Improved visual-spatial reasoning via r1-zero-like training.arXiv preprint arXiv:2504.00883, 2025

Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, and Zhijie Deng. Improved visual-spatial reasoning via r1-zero-like training.arXiv preprint arXiv:2504.00883, 2025. 2

work page arXiv 2025
[36]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 5

2014
[37]

Visual spatial reasoning.TACL, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.TACL, 2023. 2, 5

2023
[38]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1

2023
[39]

Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024. 2, 5

2024
[40]

Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025. 2

work page arXiv 2025
[41]

Visual- rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning. InICCV, 2025. 2

2025
[42]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

work page internal anchor Pith review arXiv
[43]

Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. In ICLR, 2024. 9

2024
[44]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning.arXiv preprint arXiv:2503.07365, 2025. 1, 2

work page Pith review arXiv 2025
[45]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai

AI Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai. meta. 10 com/blog/llama-4-multimodal-intelligence/, checked on, 4 (7):2025, 2025. 1

2025
[46]

Unsupervised learning of visual representations by solving jigsaw puzzles

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. InECCV,
[47]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Training lan- guage models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. InNeurIPS, 2022. 3

2022
[50]

Context encoders: Feature learning by inpainting

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. InCVPR, 2016. 2

2016
[51]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 2

2021
[52]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[53]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 1

work page internal anchor Pith review arXiv 2024
[56]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. 1

work page internal anchor Pith review arXiv 2025
[58]

Winoground: Probing vision and language models for visio- linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio- linguistic compositionality. InCVPR, 2022. 2, 6

2022
[59]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024. 2, 5

2024
[60]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2

work page internal anchor Pith review arXiv 2025
[61]

Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning. In NeurIPS, 2025. 2

2025
[62]

Mea- suring multimodal mathematical reasoning with math-vision dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset. InNeurIPS, 2024. 9

2024
[63]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In AAAI, 2025. 2, 5

2025
[64]

Llava-critic-r1: Your critic model is secretly a strong policy model.CoRR, abs/2509.00676, 2025

Xiyao Wang, Chunyuan Li, Jianwei Yang, Kai Zhang, Bo Liu, Tianyi Xiong, and Furong Huang. Llava-critic-r1: Your critic model is secretly a strong policy model.arXiv preprint arXiv:2509.00676, 2025. 2, 5, 7

work page arXiv 2025
[65]

Vicrit: A verifiable rein- forcement learning proxy task for visual perception in vlms

Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, et al. Vicrit: A verifiable rein- forcement learning proxy task for visual perception in vlms. InNeurIPS, 2025. 2

2025
[66]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Lin- jie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Li- juan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025. 1, 2, 5, 7, 8

work page arXiv 2025
[67]

Jigsaw-r1: A study of rule-based visual reinforcement learning with jigsaw puz- zles.arXiv preprint arXiv:2505.23590, 2025

Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, and Matthew B Blaschko. Jigsaw-r1: A study of rule-based visual reinforcement learning with jigsaw puz- zles.arXiv preprint arXiv:2505.23590, 2025. 2

work page arXiv 2025
[68]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, et al. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025. 2

work page internal anchor Pith review arXiv 2025
[69]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InCVPR, 2024. 2, 5

2024
[70]

Visual jigsaw post-training improves mllms

Penghao Wu, Yushan Zhang, Haiwen Diao, Bo Li, Lewei Lu, and Ziwei Liu. Visual jigsaw post-training improves mllms. arXiv preprint arXiv:2509.25190, 2025. 2, 5, 7, 8, 9

work page arXiv 2025
[71]

Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

LLM-Core-Team Xiaomi. Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025. 9

work page arXiv 2025
[72]

Unsupervised object-level representation learning from scene images

Jiahao Xie, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, and Chen Change Loy. Unsupervised object-level representation learning from scene images. InNeurIPS, 2021. 2

2021
[73]

Delving into inter-image invariance for unsupervised visual representations.IJCV, 2022

Jiahao Xie, Xiaohang Zhan, Ziwei Liu, Yew-Soon Ong, and Chen Change Loy. Delving into inter-image invariance for unsupervised visual representations.IJCV, 2022. 11

2022
[74]

Masked frequency modeling for self-supervised visual pre-training

Jiahao Xie, Wei Li, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, and Chen Change Loy. Masked frequency modeling for self-supervised visual pre-training. InICLR, 2023. 2

2023
[75]

Depth any- thing v2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2. InNeurIPS, 2024. 2, 5

2024
[76]

How to evaluate the generalization of detection? a bench- mark for comprehensive open-vocabulary detection

Yiyang Yao, Peng Liu, Tiancheng Zhao, Qianqian Zhang, Jiajia Liao, Chunxin Fang, Kyusong Lee, and Qing Wang. How to evaluate the generalization of detection? a bench- mark for comprehensive open-vocabulary detection. In AAAI, 2024. 2, 5

2024
[77]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023. 1

work page Pith review arXiv 2023
[78]

arXiv preprint arXiv:2504.07954 , year =

En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954,

work page arXiv
[79]

Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning

Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, et al. Vl-cogito: Progressive curriculum reinforcement learning for advanced multimodal reasoning. arXiv preprint arXiv:2507.22607, 2025. 1, 2, 5, 7

work page arXiv 2025
[80]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InCVPR, 2024. 9

2024

Showing first 80 references.