Recognition: unknown
Boosting Visual Instruction Tuning with Self-Supervised Guidance
Pith reviewed 2026-05-10 16:22 UTC · model grok-4.3
The pith
Reformulating self-supervised tasks as instructions improves vision-centric performance in multimodal models
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By reformulating self-supervised pretext tasks as image-instruction-response triplets that cannot be solved without visual evidence, injecting a small fraction of such instructions during visual instruction tuning yields consistent gains on vision-centric evaluations across multiple models, training regimes, and benchmarks.
What carries the argument
Reformulation of classical self-supervised pretext tasks into image-instruction-response triplets that force reliance on visual input rather than language priors.
If this is right
- Vision-centric benchmark scores rise without any model architecture or training procedure changes.
- The gains appear across different multimodal models and instruction-tuning regimes.
- Only a small fraction of the overall training data needs to consist of the visually grounded instructions.
- Adjusting the distribution of instruction data is sufficient to improve visual reasoning.
Where Pith is reading between the lines
- The same reformulation trick could be tested on other input modalities where models lean on priors.
- This points to data composition as a higher-leverage knob than scaling model size for visual tasks.
- Extending the set of pretext tasks to include additional visual properties would test the generality of the approach.
Load-bearing premise
The reformulated self-supervised tasks cannot be solved using language priors alone and therefore compel the model to utilize visual evidence.
What would settle it
If models achieve the same performance gains when the self-supervised instructions are replaced by non-visual text-only equivalents, the claim that visual grounding drives the improvement would be falsified.
Figures
read the original abstract
Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: https://github.com/sirkosophia/V-GIFT
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that augmenting visual instruction tuning with 3-10% reformulated self-supervised pretext tasks (rotation prediction, color matching, cross-view correspondence) expressed as image-instruction-response triplets improves MLLM performance on vision-centric benchmarks. These tasks are asserted to supply supervision that cannot be solved using language priors alone, thereby compelling greater utilization of visual features during tuning. The method requires no annotations, architectural changes, or extra stages, and yields consistent gains across models, regimes, and benchmarks. Code is released at https://github.com/sirkosophia/V-GIFT.
Significance. If the reported gains are robust and specifically attributable to compelled visual grounding rather than data volume or diversity effects, the work provides a lightweight, annotation-free lever for improving visual reasoning in MLLMs. This could influence data curation practices for instruction tuning. The open-source code is a clear strength that supports reproducibility and extension.
major comments (3)
- [Abstract, §3] Abstract and §3: The load-bearing assertion that the reformulated SSL tasks 'cannot be solved without relying on visual evidence' is stated but not tested. No ablation evaluates whether a language-only model or a vision-ablated input can solve the tasks above chance (e.g., via common object-color associations for color matching or orientation statistics for rotation). Without this, gains cannot be confidently attributed to visual utilization rather than generic instruction data effects.
- [§4, Table 2] §4 and Table 2: Performance tables show improvements on vision-centric evaluations, but lack controls that inject equivalent volumes of non-SSL instructions (random or language-prior-heavy) to isolate the contribution of the visual-grounding mechanism. The 3-10% fraction is presented as key, yet no scaling or volume-matched baseline is reported.
- [§4.3] §4.3: While multiple models and benchmarks are evaluated, the manuscript provides no statistical tests, run-to-run variance, or confidence intervals. This weakens the claim of 'consistent' improvements, especially given the small data fraction and potential sensitivity to training hyperparameters.
minor comments (2)
- [§2] §2: Related work on SSL in vision-language models is cited, but the discussion of how the proposed reformulation differs from prior uses of pretext tasks in MLLM training could be expanded for clarity.
- [Figure 1] Figure 1: The diagram illustrating the data augmentation pipeline is helpful, but the caption should explicitly note the exact percentage of SSL samples used in the illustrated example.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the contributions and strengthen the evidence for our claims. We address each major point below and commit to revisions that directly respond to the concerns while preserving the core findings.
read point-by-point responses
-
Referee: [Abstract, §3] The assertion that reformulated SSL tasks 'cannot be solved without relying on visual evidence' is stated but not tested. No ablation with language-only model or vision-ablated input to check if solvable above chance via priors.
Authors: We agree this explicit test would provide stronger attribution. The tasks were selected precisely because classical SSL literature shows they depend on visual properties (e.g., rotation requires image orientation; cross-view correspondence requires spatial alignment not deducible from text). In the revised manuscript we will add a controlled ablation: (i) a text-only LLM baseline on the same instruction triplets and (ii) a vision-ablated MLLM variant, demonstrating near-chance performance and thereby confirming the visual-grounding requirement. revision: yes
-
Referee: [§4, Table 2] Performance tables lack controls injecting equivalent volumes of non-SSL instructions (random or language-prior-heavy) to isolate visual-grounding mechanism; no volume-matched or scaling baseline for the 3-10% fraction.
Authors: We acknowledge that a direct volume-matched control would better isolate the mechanism. Our current setup keeps the base instruction data fixed and adds only the SSL fraction, so gains are measured atop identical data volume. In revision we will add a control experiment replacing the SSL triplets with an equal number of randomly sampled or language-prior-heavy instructions drawn from existing VQA-style data, showing that these do not produce comparable gains on vision-centric benchmarks. revision: yes
-
Referee: [§4.3] No statistical tests, run-to-run variance, or confidence intervals, weakening the 'consistent' claim given small data fraction and hyperparameter sensitivity.
Authors: We recognize the importance of statistical reporting. Experiments used fixed hyperparameters across models for fairness and showed gains on five distinct MLLMs and multiple benchmarks. Due to compute limits we did not run full multi-seed sweeps for every configuration. In the revised version we will report results from at least three independent runs for the primary settings, include standard deviations, and add a brief discussion of variance. revision: partial
Circularity Check
No significant circularity; empirical method with external benchmarks
full rationale
The paper presents an empirical data-augmentation technique: reformulating pretext tasks (rotation, color matching, cross-view) as instruction triplets and mixing 3-10% into visual instruction tuning. Performance gains are measured on held-out vision-centric benchmarks across multiple models and regimes. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear. The assertion that the tasks 'cannot be solved without visual evidence' is an unproven modeling assumption rather than a derivation that reduces to its own inputs; the reported improvements are externally falsifiable and do not rely on internal self-consistency loops. This is a standard non-circular empirical claim.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-supervised visual tasks reformulated as instructions cannot be solved without visual evidence
Reference graph
Works this paper leans on
-
[1]
In: AAAI (2020) 4
Ahmed, F., Courville, A.: Detecting semantic anomalies. In: AAAI (2020) 4
2020
-
[2]
NeurIPS (2022) 1, 3
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. NeurIPS (2022) 1, 3
2022
-
[3]
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
An, X., Xie, Y ., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y ., Xu, S., Chen, C., Zhu, D., et al.: Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661 (2025) 3, 4, 6, 9
work page internal anchor Pith review arXiv 2025
-
[4]
ICLR (2020) 3, 13, 21
Asano, Y .M., Rupprecht, C., Vedaldi, A.: A critical analysis of self-supervision, or what we can learn from a single image. ICLR (2020) 3, 13, 21
2020
-
[5]
arXiv preprint arXiv:2501.06986 (2025) 4
Azadani, M.N., Riddell, J., Sedwards, S., Czarnecki, K.: Leo: Boosting mixture of vision encoders for multimodal large language models. arXiv preprint arXiv:2501.06986 (2025) 4
-
[6]
Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
In: NeurIPS (2020) 1
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. In: NeurIPS (2020) 1
2020
-
[8]
Findings of the association for computational linguistics: ACL (2024) 3
Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., Baraldi, L., Cornia, M., Cucchiara, R.: The revolution of multimodal large language models: A survey. Findings of the association for computational linguistics: ACL (2024) 3
2024
-
[9]
arXiv preprint arXiv:2512.15885 (2025) 4 16 S
Caffagni, D., Sarto, S., Cornia, M., Baraldi, L., Dovesi, P.L., Roohi, S., Granroth-Wilding, M., Cucchiara, R.: Seeing beyond words: Self-supervised visual learning for multimodal large language models. arXiv preprint arXiv:2512.15885 (2025) 4 16 S. Sirko-Galouchenko et al
-
[10]
In: CVPR (2019) 4
Carlucci, F.M., D’Innocente, A., Bucci, S., Caputo, B., Tommasi, T.: Domain generalization by solving jigsaw puzzles. In: CVPR (2019) 4
2019
-
[11]
NeurIPS (2020) 4
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS (2020) 4
2020
-
[12]
In: ICCV (2021) 4
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021) 4
2021
-
[13]
In: CVPR (2024) 4
Cha, J., Kang, W., Mun, J., Roh, B.: Honeybee: Locality-enhanced projector for multimodal llm. In: CVPR (2024) 4
2024
-
[14]
In: CVPR (2024) 4
Chen, G., Shen, L., Shao, R., Deng, X., Nie, L.: Lion: Empowering multimodal large lan- guage model with dual-level visual knowledge. In: CVPR (2024) 4
2024
-
[15]
Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y ., Chen, Z., Duan, H., Wang, J., Qiao, Y ., Lin, D., et al.: Are we on the right way for evaluating large vision-language models? NeurIPS (2024) 9, 23
2024
-
[16]
In: ICML (2020) 4
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020) 4
2020
-
[17]
In: CVPR (2019) 4
Chen, T., Zhai, X., Ritter, M., Lucic, M., Houlsby, N.: Self-supervised gans via auxiliary rotation loss. In: CVPR (2019) 4
2019
-
[18]
Improved Baselines with Momentum Contrastive Learning
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020) 4
work page internal anchor Pith review arXiv 2003
-
[19]
Chiang, W.L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chat- gpt quality (2023) 9
2023
-
[20]
ICLR (2024) 20
Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need registers. ICLR (2024) 20
2024
-
[21]
In: CVPR (2025) 4
Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y ., Park, J.S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the- art vision-language models. In: CVPR (2025) 4
2025
-
[22]
Deng, A., Cao, T., Chen, Z., Hooi, B.: Words or vision: Do vision-language models have blind faith in text? In: CVPR (2025) 2
2025
-
[23]
In: CVPR (2015) 4
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: CVPR (2015) 4
2015
-
[24]
In: Proceedings of the 32nd ACM international conference on multimedia (2024) 21
Duan, H., Yang, J., Qiao, Y ., Fang, X., Chen, L., Liu, Y ., Dong, X., Zang, Y ., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: Proceedings of the 32nd ACM international conference on multimedia (2024) 21
2024
-
[25]
Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008,
Fu, S., Bonnen, T., Guillory, D., Darrell, T.: Hidden in plain sight: Vlms overlook their visual representations. arXiv preprint arXiv:2506.08008 (2025) 2, 4
-
[26]
In: ECCV (2024) 2, 9, 23
Fu, X., Hu, Y ., Li, B., Feng, Y ., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not perceive. In: ECCV (2024) 2, 9, 23
2024
-
[27]
In: ICCV (2019) 4
Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P., Cord, M.: Boosting few-shot visual learn- ing with self-supervision. In: ICCV (2019) 4
2019
-
[28]
In: CVPR (2020) 4
Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P., Cord, M.: Learning representations by predicting bags of visual words. In: CVPR (2020) 4
2020
-
[29]
In: CVPR (2021) 4
Gidaris, S., Bursuc, A., Puy, G., Komodakis, N., Cord, M., Pérez, P.: Obow: Online bag-of- visual-words generation for self-supervised learning. In: CVPR (2021) 4
2021
-
[30]
TMLR (2024) 4
Gidaris, S., Bursuc, A., Siméoni, O., V obeck`y, A., Komodakis, N., Cord, M., Perez, P.: Moca: Self-supervised representation learning by predicting masked online codebook assignments. TMLR (2024) 4
2024
-
[31]
In: ICLR (2018) 2, 4 V-GIFT17
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018) 2, 4 V-GIFT17
2018
-
[32]
NeurIPS (2020) 4
Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new ap- proach to self-supervised learning. NeurIPS (2020) 4
2020
-
[33]
Ssl4rl: Revisiting self-supervised learning as intrinsic reward for visual-language reasoning
Guo, X., Zhou, R., Wang, Y ., Zhang, Q., Zhang, C., Jegelka, S., Wang, X., Chai, J., Yin, G., Lin, W., et al.: Ssl4rl: Revisiting self-supervised learning as intrinsic reward for visual- language reasoning. arXiv preprint arXiv:2510.16416 (2025) 4
-
[34]
arXiv preprint arXiv:2401.02677 (2024) 7
Gupta, Y ., Jaddipal, V .V ., Prabhala, H., Paul, S., V on Platen, P.: Progressive knowledge dis- tillation of stable diffusion xl using layer level loss. arXiv preprint arXiv:2401.02677 (2024) 7
-
[35]
In: CVPR (2022) 4
He, K., Chen, X., Xie, S., Li, Y ., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022) 4
2022
-
[36]
In: CVPR (2020) 4
He, K., Fan, H., Wu, Y ., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020) 4
2020
-
[37]
In: NeurIPS (2019) 4
Hendrycks, D., Mazeika, M., Kadavath, S., Song, D.: Using self-supervised learning can improve model robustness and uncertainty. In: NeurIPS (2019) 4
2019
-
[38]
ICLR (2022) 9, 10, 13
Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR (2022) 9, 10, 13
2022
-
[39]
Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., Rouillard, L., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025) 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
In: ECCV (2024) 4
Kar, O.F., Tonioni, A., Poklukar, P., Kulshrestha, A., Zamir, A., Tombari, F.: Brave: Broad- ening the visual encoding of vision-language models. In: ECCV (2024) 4
2024
-
[41]
In: CVPR (2019) 4
Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation learning. In: CVPR (2019) 4
2019
-
[42]
NeurIPS (2023) 3
Laurençon, H., Saulnier, L., Tronchon, L., Bekman, S., Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, A., Kiela, D., et al.: Obelics: An open web-scale filtered dataset of interleaved image-text documents. NeurIPS (2023) 3
2023
-
[43]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y ., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 1, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
In: ICML (2023) 1, 3
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML (2023) 1, 3
2023
-
[45]
In: EMNLP (2023) 9, 21
Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: EMNLP (2023) 9, 21
2023
-
[46]
In: CVPR (2024) 1
Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre-training for visual language models. In: CVPR (2024) 1
2024
-
[47]
In: CVPR (2025) 4
Lin, J., Chen, H., Fan, Y ., Fan, Y ., Jin, X., Su, H., Fu, J., Shen, X.: Multi-layer visual feature fusion in multimodal llms: Methods, analysis, and best practices. In: CVPR (2025) 4
2025
-
[48]
In: ECCV (2014) 20, 21
Lin, T.Y ., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014) 20, 21
2014
-
[49]
In: CVPR (2024) 1, 3, 4, 6, 9
Liu, H., Li, C., Li, Y ., Lee, Y .J.: Improved baselines with visual instruction tuning. In: CVPR (2024) 1, 3, 4, 6, 9
2024
-
[50]
In: NeurIPS (2023) 1, 2, 3, 6
Liu, H., Li, C., Wu, Q., Lee, Y .J.: Visual instruction tuning. In: NeurIPS (2023) 1, 2, 3, 6
2023
-
[51]
Liu, Y ., Zhang, B., Zang, Y ., Cao, Y ., Xing, L., Dong, X., Duan, H., Lin, D., Wang, J.: Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning. arXiv preprint arXiv:2510.27606 (2025) 4
-
[52]
SCIS (2024) 9, 23
Liu, Y ., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. SCIS (2024) 9, 23
2024
-
[53]
ICLR (2026) 13 18 S
Long, L., Oh, C., Park, S., Li, S.: Understanding language prior of lvlms by contrasting chain-of-embedding. ICLR (2026) 13 18 S. Sirko-Galouchenko et al
2026
-
[54]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., et al.: Deepseek-vl: Towards real-world vision-language understanding, 2024. URL https://arxiv. org/abs/2403.05525 (2025) 4
work page internal anchor Pith review arXiv 2024
-
[55]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023) 9, 23
work page internal anchor Pith review arXiv 2023
-
[56]
In: ECCV (2024) 4
McKinzie, B., Gan, Z., Fauconnier, J.P., Dodge, S., Zhang, B., Dufter, P., Shah, D., Du, X., Peng, F., Belyi, A., et al.: Mm1: methods, analysis and insights from multimodal llm pre- training. In: ECCV (2024) 4
2024
-
[57]
In: ECCV (2016) 4
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016) 4
2016
-
[58]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 2, 4, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
5 technical report
Qwen, A.Y ., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al.: Qwen2. 5 technical report. arXiv preprint (2024) 9
2024
-
[60]
In: ICML (2021) 2, 3, 9
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) 2, 3, 9
2021
-
[61]
arXiv preprint arXiv:2408.15998 (2024) 4
Shi, M., Liu, F., Wang, S., Liao, S., Radhakrishnan, S., Zhao, Y ., Huang, D.A., Yin, H., Sapra, K., Yacoob, Y ., et al.: Eagle: Exploring the design space for multimodal llms with mixture of encoders. arXiv preprint arXiv:2408.15998 (2024) 4
-
[62]
In: ICCV (2025) 7, 20
Sirko-Galouchenko, S., Gidaris, S., V obecky, A., Bursuc, A., Thome, N.: Dip: Unsupervised dense in-context post-training of visual representations. In: ICCV (2025) 7, 20
2025
-
[63]
arXiv preprint arXiv:2412.03555 (2024) 1
Steiner, A., Pinto, A.S., Tschannen, M., Keysers, D., Wang, X., Bitton, Y ., Gritsenko, A., Minderer, M., Sherbondy, A., Long, S., et al.: Paligemma 2: A family of versatile vlms for transfer. arXiv preprint arXiv:2412.03555 (2024) 3
-
[64]
Team, Q., et al.: Qwen2 Technical Report. arXiv preprint arXiv:2407.10671 (2024) 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
NeurIPS (2024) 2, 3, 4, 9, 21, 23
Tong, P., Brown, E., Wu, P., Woo, S., Iyer, A.J.V ., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., et al.: Cambrian-1: A fully open, vision-centric exploration of multimodal llms. NeurIPS (2024) 2, 3, 4, 9, 21, 23
2024
-
[66]
In: CVPR (2024) 2, 4
Tong, S., Liu, Z., Zhai, Y ., Ma, Y ., LeCun, Y ., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: CVPR (2024) 2, 4
2024
-
[67]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arxiv 2023. arXiv preprint arXiv:2302.13971 (2023) 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786 (2025) 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
Venkataramanan, S., Pariza, V ., Salehi, M., Knobel, L., Gidaris, S., Ramzi, E., Bursuc, A., Asano, Y .M.: Franca: Nested matryoshka clustering for scalable visual representation learn- ing. arXiv preprint arXiv:2507.14137 (2025) 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
In: ICLR (2025) 4
Wang, H., Zheng, A., Zhao, Y ., Wang, T., Ge, Z., Zhang, X., Zhang, Z.: Reconstructive visual instruction tuning. In: ICLR (2025) 4
2025
-
[71]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
In: CVPR (2019) 2 V-GIFT19
Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: CVPR (2019) 2 V-GIFT19
2019
-
[73]
TMLR (2025) 4
Wang, Z., Zhu, J., Tang, B., Li, Z., Xiong, F., Yu, J., Blaschko, M.B.: Jigsaw-r1: A study of rule-based visual reinforcement learning with jigsaw puzzles. TMLR (2025) 4
2025
-
[74]
In: ICLR (2026) 4
Wu, P., Zhang, Y ., Diao, H., Li, B., Lu, L., Liu, Z.: Visual jigsaw post-training improves MLLMs. In: ICLR (2026) 4
2026
-
[75]
In: CVPR (2018) 4
Wu, Z., Xiong, Y ., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric in- stance discrimination. In: CVPR (2018) 4
2018
-
[76]
xAI: Grok (2024) 9, 23
2024
-
[77]
In: ICCV (2025) 9
Xie, Y ., Yang, K., An, X., Wu, K., Zhao, Y ., Deng, W., Ran, Z., Wang, Y ., Feng, Z., Miles, R., et al.: Region-based cluster discrimination for visual representation learning. In: ICCV (2025) 9
2025
-
[78]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
arXiv preprint arXiv:2509.07979 (2025) 4
Yoon, H., Jung, J., Kim, J., Choi, H., Shin, H., Lim, S., An, H., Kim, C., Han, J., Kim, D., et al.: Visual representation alignment for multimodal large language models. arXiv preprint arXiv:2509.07979 (2025) 4, 10, 23
-
[80]
A-{yA}, B-{yB},
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: ECCV (2016) 2, 4 20 S. Sirko-Galouchenko et al. A Implementation Details A.1 Self-supervised instruction-tuning tasks Colorization task.We construct a colorization-based visual reasoning task from the COCO 2017 training split [48], discarding grayscale images. For each image, we sam- ple ...
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.