Improving Reasoning in Vision-Language Models via Perception Verified Self-Training

Sadbhawna; Sonam Gupta; Sourabh Sharma

arxiv: 2606.22158 · v3 · pith:GBKI5NPLnew · submitted 2026-06-20 · 💻 cs.CV

Improving Reasoning in Vision-Language Models via Perception Verified Self-Training

Sourabh Sharma , Sonam Gupta , Sadbhawna This is my paper

Pith reviewed 2026-07-01 06:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsself-trainingchain-of-thoughtmultimodal reasoningvisual hallucinationscurriculum learningperception verification

0 comments

The pith

Vision-language models improve reasoning up to 16% by verifying captions before generating thought chains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a self-training approach for vision-language models that separates perception from reasoning using a specific chain-of-thought template. It adds an unsupervised check called PerceptEval to ensure captions accurately describe the image before using them for further reasoning. Data is split into difficulty levels and trained in stages so that only reliable visual understanding supports the reasoning steps. This avoids the need for expensive human-annotated rationales while reducing errors from misperceiving the image. The result is better performance on multimodal reasoning tasks across different models and domains.

Core claim

The central claim is that enforcing visual grounding through caption verification and a two-stage curriculum on partitioned data allows self-training to produce more accurate reasoning chains in VLMs, leading to gains of up to 16% over standard self-training methods that only check answer correctness.

What carries the argument

The perception-verified self-training framework, which uses a caption-reasoning-conclusion template and PerceptEval to filter for perceptually grounded samples before curriculum training.

Load-bearing premise

Unsupervised PerceptEval can accurately judge caption quality from image alignment without any ground-truth references.

What would settle it

A controlled experiment comparing training with and without the caption verification step on the same self-generated rationales would show whether the perception check is necessary for the reported gains.

Figures

Figures reproduced from arXiv: 2606.22158 by Sadbhawna, Sonam Gupta, Sourabh Sharma.

**Figure 1.** Figure 1: Comparison with STaR [29] and R3V [5]. Both STaR and R3V suffer from visual hallucinations (e.g., purple particles, white jar) and language shortcuts (e.g., concluding without valid justification or inferring temperature from particle shape and color), as they filter samples solely based on final answer correctness when constructing the rationale training set. Our framework mitigates this issue through uns… view at source ↗

**Figure 2.** Figure 2: (a) Overview of our proposed framework. Dashed arrows indicate finetuning steps. Stage-1 fine-tuning uses only easy cases, while Stage-2 incorporates both easy and medium cases. AnswerEval verifies whether the generated conclusion matches the ground-truth answer, and PerceptEval assesses caption quality in the absence of ground-truth captions. (b) Example of a medium case where the model initially produce… view at source ↗

**Figure 3.** Figure 3: Illustration of PerceptEval. OCR agreement is computed as the cosine similarity between the sentence-transformer embeddings [20] of the generated caption and an auxiliary caption obtained from PaddleOCR [6]. Visual similarity is measured using FG-CLIP [27] between the image and the generated caption. FG-CLIP [27] focuses on visual element descriptions in the caption (blue), while OCR alignment ensures the… view at source ↗

**Figure 4.** Figure 4: Qualitative Analysis. (a) a test example where our method produces the correct answer with high-quality caption and reasoning, while the baselines, STaR [29] and R3V [5], generate less detailed and visually hallucinated rationales (e.g., fish underwater). (b) An example of a language shortcut (camera -> job interview) in STaR and R3V rationales. Our method correctly identifies key objects such as the suitc… view at source ↗

**Figure 6.** Figure 6: Importance of FG-CLIP in preventing failure cases for text-dominated images [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Validation accuracy comparison of our method with and without [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 7.** Figure 7: Validation accuracy comparison of our method with and without [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt templates – Caption: Instructs the model to generate a detailed description of the image, including both visual elements and any transcribed text. This caption provides all the relevant information needed to answer the question. – Reasoning: Instructs the model to produce detailed reasoning. Since the reasoning follows the caption, it implicitly relies on the extracted visual details. – Conclusion:… view at source ↗

**Figure 9.** Figure 9: Screenshot of our subjective annotation GUI. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 9.** Figure 9: Screenshot of our subjective annotation GUI. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of training data usage and test accuracy across [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 10.** Figure 10: Confusion matrix showing the agreement between PerceptEval and human annotations on the LS domain. randomly sample 100 captions filtered by PerceptEval from the Language Science (LS) domain, consisting of 50 captions predicted as correct and 50 predicted as incorrect with LLaVA-v1.5-7B model. Three human annotators independently assess whether each caption is visually grounded and accurately describes the… view at source ↗

**Figure 11.** Figure 11: Qualitative Analysis. STaR [29] and R3V [5] both hallucinate visual content, misinterpreting the embedded text as an image of a man and woman looking at each other, which leads to an incorrect answer. In contrast, our method correctly perceives the textual content and produces the appropriate interpretation [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 11.** Figure 11: Qualitative Analysis. STaR [34] and R3V [4] both hallucinate visual content, misinterpreting the embedded text as an image of a man and woman looking at each other, which leads to an incorrect answer. In contrast, our method correctly perceives the textual content and produces the appropriate interpretation [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative Analysis. Due to their entangled perception–reasoning pipelines, STaR [29] and R3V [5] amplify the model’s inherent language bias, vaguely linking the figure of speech with metaphor, leading to incorrect reasoning. Furthermore, the model ended up selecting option A under ambiguity. In contrast, our method, supported by accurate OCR-based perception, correctly identifies the paradox and gener… view at source ↗

**Figure 12.** Figure 12: Qualitative Analysis. Due to their entangled perception–reasoning pipelines, STaR [34] and R3V [4] amplify the model’s inherent language bias, vaguely linking the figure of speech with metaphor, leading to incorrect reasoning. Furthermore, the model ended up selecting option A under ambiguity. In contrast, our method, supported by accurate OCR-based perception, correctly identifies the paradox and gener… view at source ↗

**Figure 13.** Figure 13: Qualitative Analysis. Both STaR [29] and R3V [5] misclassify Daniel Bonne as a character, ultimately resorting to Option A under ambiguity. Our method correctly identifies him as a real person due to the correct OCR perception [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 13.** Figure 13: Qualitative Analysis. Both STaR [34] and R3V [4] misclassify Daniel Bonne as a character, ultimately resorting to Option A under ambiguity. Our method correctly identifies him as a real person due to the correct OCR perception [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative Analysis. Due to inherent language bias, STaR [29] and R3V [5] misinterpret the scene, associating camera recording with a job interview rather than attending to the visual details. In contrast, our method correctly identifies key objects such as the suitcase, which grounds the reasoning and leads to the correct answer [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 14.** Figure 14: Qualitative Analysis. Due to inherent language bias, STaR [34] and R3V [4] misinterpret the scene, associating camera recording with a job interview rather than attending to the visual details. In contrast, our method correctly identifies key objects such as the suitcase, which grounds the reasoning and leads to the correct answer [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative Analysis. In the figure, STaR [29] incorrectly identifies the croissant as a sandwich, while R3V [5] and our method identify the key objects correctly, such as a croissant and a cup of coffee, leading to correct reasoning. This illustrates the importance of correct perception in sound reasoning [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 15.** Figure 15: Qualitative Analysis. In the figure, STaR [34] incorrectly identifies the croissant as a sandwich, while R3V [4] and our method identify the key objects correctly, such as a croissant and a cup of coffee, leading to correct reasoning. This illustrates the importance of correct perception in sound reasoning [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative Analysis. STaR [29] and R3V [5] responses suffer from language bias of associating phone with touch screen display, hence ignoring the visual details present. Further, due to a flawed hint-based augmentation strategy, STaR generated the correct answer despite an incorrect rationale, highlighting its tendency to take shortcuts. Our method first correctly identifies the visual details, i.e. ca… view at source ↗

**Figure 16.** Figure 16: Qualitative Analysis. STaR [34] and R3V [4] responses suffer from language bias of associating phone with touch screen display, hence ignoring the visual details present. Further, due to a flawed hint-based augmentation strategy, STaR generated the correct answer despite an incorrect rationale, highlighting its tendency to take shortcuts. Our method first correctly identifies the visual details, i.e. ca… view at source ↗

**Figure 17.** Figure 17: Qualitative Analysis. STaR [29] and R3V [5] responses contain visually hallucinating objects such as several cups and books in the room and dining table and chairs despite having the correct answer. Our method’s rationale correctly attends to visual details, leading to correct interpretation, such as watching TV or spending time with pets [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 17.** Figure 17: Qualitative Analysis. STaR [34] and R3V [4] responses contain visually hallucinating objects such as several cups and books in the room and dining table and chairs despite having the correct answer. Our method’s rationale correctly attends to visual details, leading to correct interpretation, such as watching TV or spending time with pets [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative Analysis. Despite generating correct answers, the rationales of STaR [29] and R3V [5] are highly inconsistent. Firstly, the responses rely on incorrect characteristics, such as colors and shapes of particles, to infer which sample has the higher temperature, highlighting an incorrect thought process. Secondly, they identify false visual details such as purple particles, green particles in the … view at source ↗

**Figure 18.** Figure 18: Qualitative Analysis. Despite generating correct answers, the rationales of STaR [34] and R3V [4] are highly inconsistent. Firstly, the responses rely on incorrect characteristics, such as colors and shapes of particles, to infer which sample has the higher temperature, highlighting an incorrect thought process. Secondly, they identify false visual details such as purple particles, green particles in the … view at source ↗

**Figure 19.** Figure 19: Qualitative Analysis. STaR [29] generated insufficient reasoning to correctly answer the question, highlighting its tendency to take shortcuts. Further, R3V [5], despite generating the correct answer, has flawed reasoning. The response incorrectly relies on the conduction ability of the balls to predict the correct answer, rather than attending to the table present in the figure. Our method correctly iden… view at source ↗

**Figure 19.** Figure 19: Qualitative Analysis. STaR [34] generated insufficient reasoning to correctly answer the question, highlighting its tendency to take shortcuts. Further, R3V [4], despite generating the correct answer, has flawed reasoning. The response incorrectly relies on the conduction ability of the balls to predict the correct answer, rather than attending to the table present in the figure. Our method correctly iden… view at source ↗

**Figure 20.** Figure 20: Qualitative Analysis. Due to entangled perception and reasoning, STaR [29] and R3V [5] often fall short, producing visual hallucinations, such as a seagull and a bird with a snake in its mouth. Due to the correct identification of image components, i.e. flamingo, description, snake and eagle, our method’s reasoning appears perfect [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗

**Figure 20.** Figure 20: Qualitative Analysis. Due to entangled perception and reasoning, STaR [34] and R3V [4] often fall short, producing visual hallucinations, such as a seagull and a bird with a snake in its mouth. Due to the correct identification of image components, i.e. flamingo, description, snake and eagle, our method’s reasoning appears perfect [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗

**Figure 21.** Figure 21: Qualitative Analysis. STaR [29] and R3V [5] fail to produce detailed reasoning, while due to the employed structured rationale prompt, our method’s generated rationale is more detailed and correct [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗

**Figure 21.** Figure 21: Qualitative Analysis. STaR [34] and R3V [4] fail to produce detailed reasoning, while due to the employed structured rationale prompt, our method’s generated rationale is more detailed and correct [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗

**Figure 22.** Figure 22: Qualitative Analysis. Rationales of both STaR [29] and R3V [5] suffer from visual hallucinations, i.e. Arctic and red circle, while our method’s rationale is entirely reasonable, i.e. identifying the temperature from the given temperature scale and color density of the outlined region [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗

**Figure 22.** Figure 22: Qualitative Analysis. Rationales of both STaR [34] and R3V [4] suffer from visual hallucinations, i.e. Arctic and red circle, while our method’s rationale is entirely reasonable, i.e. identifying the temperature from the given temperature scale and color density of the outlined region [PITH_FULL_IMAGE:figures/full_fig_p029_22.png] view at source ↗

**Figure 23.** Figure 23: Qualitative Analysis. STaR [29] generated inconsistent and insufficient reasoning to correctly answer the question, while R3V [5] refused to answer the question but guessed the correct answer. Our method’s rationale has correctly interpreted the trend presented in the graphs to generate the correct answer [PITH_FULL_IMAGE:figures/full_fig_p027_23.png] view at source ↗

**Figure 23.** Figure 23: Qualitative Analysis. STaR [34] generated inconsistent and insufficient reasoning to correctly answer the question, while R3V [4] refused to answer the question but guessed the correct answer. Our method’s rationale has correctly interpreted the trend presented in the graphs to generate the correct answer [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗

**Figure 24.** Figure 24: Qualitative Analysis. Subfigure (a) presents a test example where our method produces the correct answer along with high-quality captions and reasoning, whereas the baseline methods, STaR [29] and R3V [5], fail to do so. Subfigure (b) illustrates a case where all methods correctly predict the answer, but our approach generates noticeably more accurate captions and more coherent reasoning [PITH_FULL_IMAGE… view at source ↗

**Figure 25.** Figure 25: Comparison with STaR [29] and R3V [5]. Existing self-training methods for VLMs often struggle with visual hallucinations (e.g., misidentifying objects) and language shortcuts (e.g., relying on biases such as “TV → remote”). Our framework mitigates these issues by explicitly separating perception from reasoning and jointly optimizing both. By generating accurate self-captions, the model grounds its reasoni… view at source ↗

**Figure 25.** Figure 25: Comparison with STaR [34] and R3V [4]. Existing self-training methods for VLMs often struggle with visual hallucinations (e.g., misidentifying objects) and language shortcuts (e.g., relying on biases such as “TV → remote”). Our framework mitigates these issues by explicitly separating perception from reasoning and jointly optimizing both. By generating accurate self-captions, the model grounds its reasoni… view at source ↗

read the original abstract

Achieving human-like reasoning in Vision-Language Models (VLMs) remains a long-standing challenge. Recent approaches leverage Chain-of-Thought (CoT) rationales generated by human annotators or proprietary models, which are costly and difficult to scale. Self-training offers a promising alternative but often suffers from visual hallucinations and language shortcuts because rationales are filtered only by answer correctness without verifying visual perception. We propose a perception-verified self-training framework that enforces visually grounded reasoning. Our method employs a CoT template (caption-reasoning-conclusion) that disentangles perception from reasoning, enabling independent verification of visual understanding. To compensate for the absence of ground-truth captions, we introduce PerceptEval, an unsupervised method that evaluates caption quality based on its alignment with visual and textual elements in the image. Using caption verification together with answer correctness, we partition the data into easy, medium, and hard subsets and design a two-stage curriculum learning strategy. Stage 1 trains on easy samples, while Stage 2 enhances medium samples by regenerating reasoning conditioned on verified captions and retaining only those with correct conclusions. This ensures training is performed exclusively on perceptually grounded reasoning, reducing hallucinations and language shortcuts. Extensive experiments across diverse domains and models demonstrate improvements of up to 16% over standard self-training baselines, showing that our framework provides a scalable and cost-effective solution for advancing multimodal reasoning without manually annotated CoT rationales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds an unsupervised caption verifier and a two-stage curriculum to self-training for VLMs, claiming up to 16% gains, but the results rest on whether PerceptEval actually filters for visual grounding.

read the letter

The core move is to force self-generated CoT to pass a perception check before it gets used for training. They fix a caption-reasoning-conclusion template, run an unsupervised PerceptEval on the caption part, combine that score with answer correctness, split the data into easy/medium/hard, and train in two stages: first on the easy verified set, then regenerate reasoning on the medium set conditioned on the verified captions. That is the concrete difference from plain self-training.

The template separation is useful because it lets them check the visual part independently instead of hoping the final answer catches hallucinations. The curriculum logic also follows from the partitioning. Experiments on multiple models and domains are reported, which at least shows the method is not tied to one narrow setting.

The load-bearing piece is PerceptEval itself. The abstract describes it as measuring alignment with visual and textual elements without ground truth, but gives no validation numbers, no correlation with human judgments on caption quality, and no ablation on what happens when the verifier is noisy. If the scores mostly track language features or dataset artifacts rather than actual visual fidelity, the 16% lift could come from reweighting data rather than from perceptually grounded reasoning. The stress-test note is right on this point.

The paper is for groups already running self-training loops on VLMs and looking for cheap ways to reduce visual shortcuts. It is worth sending to referees because the problem is practical, the method is fully specified in the abstract, and the empirical claim is testable, even if the current evidence on the verifier is thin.

Referee Report

3 major / 2 minor

Summary. The paper claims that a perception-verified self-training framework for VLMs, built around a caption-reasoning-conclusion CoT template and an unsupervised PerceptEval method for caption quality assessment, enables data partitioning into easy/medium/hard subsets and a two-stage curriculum that produces perceptually grounded reasoning, yielding up to 16% gains over standard self-training baselines across domains and models without requiring manual CoT annotations.

Significance. If the empirical gains are robust and attributable to the perception-verification step rather than noisy reweighting, the work would supply a scalable route to reduce hallucinations and language shortcuts in multimodal self-training, a practically relevant advance given the cost of human or proprietary CoT data.

major comments (3)

[Abstract] Abstract and method overview: PerceptEval is presented as the load-bearing component that 'evaluates caption quality based on its alignment with visual and textual elements' without ground truth, yet no validation (correlation with human judgments, ablation of its alignment metric, or comparison to random filtering) is described; this directly undermines attribution of the reported 16% lift to the proposed mechanism rather than data selection artifacts.
[Abstract] Experimental claims: The abstract states 'extensive experiments across diverse domains and models' with 'improvements of up to 16%', but supplies no information on dataset sizes, baseline implementations, statistical tests, or error analysis; without these, the central empirical claim cannot be assessed for reproducibility or robustness.
[§4] §4 (curriculum design): The two-stage strategy retains only medium samples whose regenerated reasoning yields correct conclusions after conditioning on verified captions; if PerceptEval scores correlate only weakly with actual visual fidelity, this filtering step reduces to answer-correctness filtering plus a noisy proxy, collapsing the distinction from standard self-training.

minor comments (2)

[Abstract] The abstract repeatedly uses 'perceptually grounded' without a precise operational definition that distinguishes it from answer correctness alone.
[Method] Notation for the CoT template (caption-reasoning-conclusion) is introduced but not formalized with an equation or pseudocode, which would aid clarity in the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below. Where the comments identify gaps in validation or presentation, we will revise the manuscript accordingly. We believe the core contribution remains intact and can be strengthened through these changes.

read point-by-point responses

Referee: [Abstract] Abstract and method overview: PerceptEval is presented as the load-bearing component that 'evaluates caption quality based on its alignment with visual and textual elements' without ground truth, yet no validation (correlation with human judgments, ablation of its alignment metric, or comparison to random filtering) is described; this directly undermines attribution of the reported 16% lift to the proposed mechanism rather than data selection artifacts.

Authors: We agree that stronger validation of PerceptEval would improve attribution. The full manuscript contains ablations demonstrating that PerceptEval-based filtering outperforms both random selection and answer-correctness-only baselines. However, we did not include explicit human correlation or metric ablation in the submitted version. We will add a dedicated validation subsection reporting Spearman correlation with human judgments on a held-out set of 500 captions and an ablation of the alignment metric components. revision: yes
Referee: [Abstract] Experimental claims: The abstract states 'extensive experiments across diverse domains and models' with 'improvements of up to 16%', but supplies no information on dataset sizes, baseline implementations, statistical tests, or error analysis; without these, the central empirical claim cannot be assessed for reproducibility or robustness.

Authors: The experimental section (§5) specifies dataset sizes per domain, describes baseline implementations as standard self-training with answer filtering, and reports results with standard deviations over three random seeds. The abstract itself is intentionally concise. We will revise the abstract to include brief references to these elements and add paired statistical significance tests in the results tables during revision. revision: yes
Referee: [§4] §4 (curriculum design): The two-stage strategy retains only medium samples whose regenerated reasoning yields correct conclusions after conditioning on verified captions; if PerceptEval scores correlate only weakly with actual visual fidelity, this filtering step reduces to answer-correctness filtering plus a noisy proxy, collapsing the distinction from standard self-training.

Authors: The distinction is preserved because Stage 2 explicitly regenerates reasoning conditioned on PerceptEval-verified captions before applying the correctness filter; ablations removing the caption verification step show clear performance drops. We acknowledge that stronger evidence of PerceptEval's correlation would further separate the methods. We will add an analysis of PerceptEval precision on visually faithful vs. hallucinated captions in the revised §4. revision: partial

Circularity Check

0 steps flagged

No circularity; derivation relies on proposed unsupervised PerceptEval and external experimental validation

full rationale

The paper defines a perception-verified self-training pipeline that introduces PerceptEval (unsupervised alignment-based caption scoring) to partition data and applies a two-stage curriculum, with gains measured against standard self-training baselines in experiments. No quoted step reduces a prediction or result to a fitted input by construction, no self-citation chain bears the central claim, and no ansatz or uniqueness theorem is smuggled in. The framework is self-contained against external benchmarks (performance deltas up to 16%), satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based solely on abstract, the main introduced element is the PerceptEval method; no free parameters, standard axioms, or other invented entities are described.

invented entities (1)

PerceptEval no independent evidence
purpose: Unsupervised evaluation of generated caption quality via alignment with visual and textual elements
New component introduced to address lack of ground-truth captions for verification.

pith-pipeline@v0.9.1-grok · 5783 in / 1300 out tokens · 48722 ms · 2026-07-01T06:27:12.880924+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 12 canonical work pages · 6 internal anchors

[1]

al., R.: Direct preference optimization: your language model is secretly a reward model

et. al., R.: Direct preference optimization: your language model is secretly a reward model. In: NeurIPS (2023)

2023
[2]

In: Ku, L.W., Martins, A., Srikumar, V

Chen, Q., Qin, L., Zhang, J., Chen, Z., Xu, X., Che, W.: M3CoT: A novel bench- mark for multi-domain multi-step multi-modal chain-of-thought. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 8199–
[3]

https://doi.org/10.18653/v1/2024.acl- long.446,https://aclanthology

Association for Computational Linguistics, Bangkok, Thailand (Aug 2024). https://doi.org/10.18653/v1/2024.acl- long.446,https://aclanthology. org/2024.acl-long.446/

work page doi:10.18653/v1/2024.acl- 2024
[4]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Muyan, Z., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., Dai, J.: Intern vl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 24185–24198 (2023),https://api.semanticsch...

2024
[5]

The North American Chapter of the Association for Computational Linguistics (2025)

Cheng, K., Li, Y., Xu, F., Zhang, J., Zhou, H., Liu, Y.: Vision-language mod- els can self-improve reasoning via reflection. The North American Chapter of the Association for Computational Linguistics (2025)

2025
[6]

Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., Zhang, Y., Lv, W., Huang, K., Zhang, Y., Zhang, J., Zhang, J., Liu, Y., Yu, D., Ma, Y.: Paddleocr 3.0 technical report (2025),https://arxiv.org/ abs/2507.05595

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

In: Proceedings of the 38th International Conference on Neural Information Processing Systems

Deng, Y., Lu, P., Yin, F., Hu, Z., Shen, S., Zou, J., Chang, K.W., Wang, W.: Enhancing large vision language models with self-training on image comprehen- sion. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. NIPS ’24, Curran Associates Inc., Red Hook, NY, USA (2024) 32 S. Sharma et al

2024
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1437...

2024
[9]

Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Tang, X., Hu, Y., Lin, S.: Vision-r1: Incentivizing reasoning capability in multimodal large language models (2026),https://arxiv.org/abs/2503.06749

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

In: Proceedings of the 36th International Conference on Neural Information Processing Systems

Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. NIPS ’22, Curran Associates Inc. (2022)

2022
[11]

In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

Kojima,T.,Gu,S.S.,Reid,M.,Matsuo,Y.,Iwasawa,Y.:Largelanguagemodelsare zero-shot reasoners. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 22199–22213. Curran Associates, Inc. (2022),https://proceedings.neurips.cc/ paper _ files / paper / 2022 / file / 8bb0d291a...

2022
[12]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, J., Zhang, D., Wang, X., Hao, Z., Lei, J., Tan, Q., Zhou, C., Liu, W., Wang, W., Chen, Z., Wang, W., Li, W., Zhang, S., Su, M., Ouyang, W., Li, Y., Zhou, D.: Chemvlm: Exploring the power of multimodal large language models in chemistry area. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 415–423 (2025)

2025
[13]

In: Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers)

Li, Y., Lin, Z., Zhang, S., Fu, Q., Chen, B., Lou, J.G., Chen, W.: Making language models better reasoners with step-aware verifier. In: Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). pp. 5315–5333 (2023)

2023
[14]

In: Findings of the Association for Computational Linguistics: ACL 2025

Li, Z., Tang, B., Niu, Y., Jin, B., Shi, Q., Feng, Y., Li, Z., Hu, J., Yang, M., Xiong, F.: Care-star: Constraint-aware self-taught reasoner. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 21689–21703 (2025)

2025
[15]

Li, Z., Yu, W., Huang, C., Liang, Z., Liu, R., Liu, F., Che, J., Yu, D., Boyd- Graber, J., Mi, H., Yu, D.: Self-rewarding vision-language model via reasoning decomposition (2026),https://arxiv.org/abs/2508.19652

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023)

2023
[17]

io/blog/2024-01-30-llava-next/

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/

2024
[18]

Lu, J., Dou, Z., Wang, H., Cao, Z., Dai, J., Feng, Y., Guo, Z.: Autopsv: Automated process-supervisedverifier.AdvancesinNeuralInformationProcessingSystems37, 79935–79962 (2024)

2024
[19]

arXiv preprint arXiv:2401.08967 (2024)

Luong, T.Q., Zhang, X., Jie, Z., Sun, P., Jin, X., Li, H.: Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967 (2024)

work page arXiv 2024
[20]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

Mitra, C., Huang, B., Darrell, T., Herzig, R.: Compositional chain-of-thought prompting for large multimodal models. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 14420–14431 (2024)

2024
[21]

Mitra, C., Huang, B., Darrell, T., Herzig, R.: Compositional chain-of-thought prompting for large multimodal models (2024),https://arxiv.org/abs/2311. 17076

2024
[22]

In: OpenAI (2023),https : / / api

OpenAI: GPT-4Vision system card. In: OpenAI (2023),https : / / api . semanticscholar.org/CorpusID:263218031 Perception Verified Self-Training 33

2023
[23]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum? id=4XIKfvNYvx

Pang, R.Y., Yuan, W., He, H., Cho, K., Sukhbaatar, S., Weston, J.E.: Iterative reasoning preference optimization. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum? id=4XIKfvNYvx

2024
[24]

Advances in neural information processing systems36, 53728–53741 (2023)

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)

2023
[25]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (11 2019),https: //arxiv.org/abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019
[26]

Advances in Neural Information Processing Systems37, 8612–8642 (2024)

Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y., Li, H.: Vi- sual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems37, 8612–8642 (2024)

2024
[27]

In: Conference on Empirical Methods in Natural Language Pro- cessing (2022),https://api.semanticscholar.org/CorpusID:253098851

Wang, B., Deng, X., Sun, H.: Iteratively prompt pre-trained language models for chain of thought. In: Conference on Empirical Methods in Natural Language Pro- cessing (2022),https://api.semanticscholar.org/CorpusID:253098851

2022
[28]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Machine In- telligence Research20, 447 – 482 (2023),https://api.semanticscholar.org/ CorpusID:257038341

Wang, X., Chen, G., Qian, G., Gao, P., Wei, X., Wang, Y., Tian, Y., Gao, W.: Large-scale multi-modal pre-trained models: A comprehensive survey. Machine In- telligence Research20, 447 – 482 (2023),https://api.semanticscholar.org/ CorpusID:257038341

2023
[30]

NIPS ’22, Curran Associates Inc

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E.H., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models.In:Proceedingsofthe36thInternationalConferenceonNeuralInformation Processing Systems. NIPS ’22, Curran Associates Inc. (2022)

2022
[31]

Wu, Y., Zhang, P., Xiong, W., Oguz, B., Gee, J.C., Nie, Y.: The role of chain-of- thought in complex vision-language reasoning task (2023),https://arxiv.org/ abs/2311.09193

work page arXiv 2023
[32]

Xia, J., Zang, Y., Gao, P., Li, S., Zhou, K.: Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning (2025),https://arxiv.org/abs/ 2505.14677

work page arXiv 2025
[33]

arXiv preprint arXiv:2505.05071 (2025)

Xie, C., Wang, B., Kong, F., Li, J., Liang, D., Zhang, G., Leng, D., Yin, Y.: Fg-clip: Fine-grained visual and textual alignment. arXiv preprint arXiv:2505.05071 (2025)

work page arXiv 2025
[34]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9556– 9567 (2024)

2024
[35]

In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K

Zelikman, E., Wu, Y., Mu, J., Goodman, N.: STar: Bootstrapping reasoning with reasoning. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022),https://openreview.net/forum? id=_3ELRdg2sgI

2022
[36]

Advances in Neural Information Processing Systems37, 64735–64772 (2024) 34 S

Zhang, D., Zhoubian, S., Hu, Z., Yue, Y., Dong, Y., Tang, J.: Rest-mcts*: Llm self-training via process reward guided tree search. Advances in Neural Information Processing Systems37, 64735–64772 (2024) 34 S. Sharma et al

2024
[37]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

In: Proceedings of the 37th International Conference on Neural Information Processing Systems

Zheng, G., Yang, B., Tang, J., Zhou, H.Y., Yang, S.: Ddcot: duty-distinct chain-of- thought prompting for multimodal reasoning in language models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23, Curran Associates Inc. (2023)

2023
[39]

arXiv preprint arXiv:2407.06189 (2024)

Zohar, O., Wang, X., Bitton, Y., Szpektor, I., Yeung-Levy, S.: Video-star: Self- training enables video instruction tuning with any supervision. arXiv preprint arXiv:2407.06189 (2024)

work page arXiv 2024

[1] [1]

al., R.: Direct preference optimization: your language model is secretly a reward model

et. al., R.: Direct preference optimization: your language model is secretly a reward model. In: NeurIPS (2023)

2023

[2] [2]

In: Ku, L.W., Martins, A., Srikumar, V

Chen, Q., Qin, L., Zhang, J., Chen, Z., Xu, X., Che, W.: M3CoT: A novel bench- mark for multi-domain multi-step multi-modal chain-of-thought. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 8199–

[3] [3]

https://doi.org/10.18653/v1/2024.acl- long.446,https://aclanthology

Association for Computational Linguistics, Bangkok, Thailand (Aug 2024). https://doi.org/10.18653/v1/2024.acl- long.446,https://aclanthology. org/2024.acl-long.446/

work page doi:10.18653/v1/2024.acl- 2024

[4] [4]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Muyan, Z., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., Dai, J.: Intern vl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 24185–24198 (2023),https://api.semanticsch...

2024

[5] [5]

The North American Chapter of the Association for Computational Linguistics (2025)

Cheng, K., Li, Y., Xu, F., Zhang, J., Zhou, H., Liu, Y.: Vision-language mod- els can self-improve reasoning via reflection. The North American Chapter of the Association for Computational Linguistics (2025)

2025

[6] [6]

Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., Zhang, Y., Lv, W., Huang, K., Zhang, Y., Zhang, J., Zhang, J., Liu, Y., Yu, D., Ma, Y.: Paddleocr 3.0 technical report (2025),https://arxiv.org/ abs/2507.05595

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

In: Proceedings of the 38th International Conference on Neural Information Processing Systems

Deng, Y., Lu, P., Yin, F., Hu, Z., Shen, S., Zou, J., Chang, K.W., Wang, W.: Enhancing large vision language models with self-training on image comprehen- sion. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. NIPS ’24, Curran Associates Inc., Red Hook, NY, USA (2024) 32 S. Sharma et al

2024

[8] [8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1437...

2024

[9] [9]

Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Tang, X., Hu, Y., Lin, S.: Vision-r1: Incentivizing reasoning capability in multimodal large language models (2026),https://arxiv.org/abs/2503.06749

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

In: Proceedings of the 36th International Conference on Neural Information Processing Systems

Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. NIPS ’22, Curran Associates Inc. (2022)

2022

[11] [11]

In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

Kojima,T.,Gu,S.S.,Reid,M.,Matsuo,Y.,Iwasawa,Y.:Largelanguagemodelsare zero-shot reasoners. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 22199–22213. Curran Associates, Inc. (2022),https://proceedings.neurips.cc/ paper _ files / paper / 2022 / file / 8bb0d291a...

2022

[12] [12]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, J., Zhang, D., Wang, X., Hao, Z., Lei, J., Tan, Q., Zhou, C., Liu, W., Wang, W., Chen, Z., Wang, W., Li, W., Zhang, S., Su, M., Ouyang, W., Li, Y., Zhou, D.: Chemvlm: Exploring the power of multimodal large language models in chemistry area. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 415–423 (2025)

2025

[13] [13]

In: Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers)

Li, Y., Lin, Z., Zhang, S., Fu, Q., Chen, B., Lou, J.G., Chen, W.: Making language models better reasoners with step-aware verifier. In: Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). pp. 5315–5333 (2023)

2023

[14] [14]

In: Findings of the Association for Computational Linguistics: ACL 2025

Li, Z., Tang, B., Niu, Y., Jin, B., Shi, Q., Feng, Y., Li, Z., Hu, J., Yang, M., Xiong, F.: Care-star: Constraint-aware self-taught reasoner. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 21689–21703 (2025)

2025

[15] [15]

Li, Z., Yu, W., Huang, C., Liang, Z., Liu, R., Liu, F., Che, J., Yu, D., Boyd- Graber, J., Mi, H., Yu, D.: Self-rewarding vision-language model via reasoning decomposition (2026),https://arxiv.org/abs/2508.19652

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023)

2023

[17] [17]

io/blog/2024-01-30-llava-next/

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/

2024

[18] [18]

Lu, J., Dou, Z., Wang, H., Cao, Z., Dai, J., Feng, Y., Guo, Z.: Autopsv: Automated process-supervisedverifier.AdvancesinNeuralInformationProcessingSystems37, 79935–79962 (2024)

2024

[19] [19]

arXiv preprint arXiv:2401.08967 (2024)

Luong, T.Q., Zhang, X., Jie, Z., Sun, P., Jin, X., Li, H.: Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967 (2024)

work page arXiv 2024

[20] [20]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

Mitra, C., Huang, B., Darrell, T., Herzig, R.: Compositional chain-of-thought prompting for large multimodal models. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 14420–14431 (2024)

2024

[21] [21]

Mitra, C., Huang, B., Darrell, T., Herzig, R.: Compositional chain-of-thought prompting for large multimodal models (2024),https://arxiv.org/abs/2311. 17076

2024

[22] [22]

In: OpenAI (2023),https : / / api

OpenAI: GPT-4Vision system card. In: OpenAI (2023),https : / / api . semanticscholar.org/CorpusID:263218031 Perception Verified Self-Training 33

2023

[23] [23]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum? id=4XIKfvNYvx

Pang, R.Y., Yuan, W., He, H., Cho, K., Sukhbaatar, S., Weston, J.E.: Iterative reasoning preference optimization. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum? id=4XIKfvNYvx

2024

[24] [24]

Advances in neural information processing systems36, 53728–53741 (2023)

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)

2023

[25] [25]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (11 2019),https: //arxiv.org/abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019

[26] [26]

Advances in Neural Information Processing Systems37, 8612–8642 (2024)

Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y., Li, H.: Vi- sual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems37, 8612–8642 (2024)

2024

[27] [27]

In: Conference on Empirical Methods in Natural Language Pro- cessing (2022),https://api.semanticscholar.org/CorpusID:253098851

Wang, B., Deng, X., Sun, H.: Iteratively prompt pre-trained language models for chain of thought. In: Conference on Empirical Methods in Natural Language Pro- cessing (2022),https://api.semanticscholar.org/CorpusID:253098851

2022

[28] [28]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Machine In- telligence Research20, 447 – 482 (2023),https://api.semanticscholar.org/ CorpusID:257038341

Wang, X., Chen, G., Qian, G., Gao, P., Wei, X., Wang, Y., Tian, Y., Gao, W.: Large-scale multi-modal pre-trained models: A comprehensive survey. Machine In- telligence Research20, 447 – 482 (2023),https://api.semanticscholar.org/ CorpusID:257038341

2023

[30] [30]

NIPS ’22, Curran Associates Inc

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E.H., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models.In:Proceedingsofthe36thInternationalConferenceonNeuralInformation Processing Systems. NIPS ’22, Curran Associates Inc. (2022)

2022

[31] [31]

Wu, Y., Zhang, P., Xiong, W., Oguz, B., Gee, J.C., Nie, Y.: The role of chain-of- thought in complex vision-language reasoning task (2023),https://arxiv.org/ abs/2311.09193

work page arXiv 2023

[32] [32]

Xia, J., Zang, Y., Gao, P., Li, S., Zhou, K.: Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning (2025),https://arxiv.org/abs/ 2505.14677

work page arXiv 2025

[33] [33]

arXiv preprint arXiv:2505.05071 (2025)

Xie, C., Wang, B., Kong, F., Li, J., Liang, D., Zhang, G., Leng, D., Yin, Y.: Fg-clip: Fine-grained visual and textual alignment. arXiv preprint arXiv:2505.05071 (2025)

work page arXiv 2025

[34] [34]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9556– 9567 (2024)

2024

[35] [35]

In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K

Zelikman, E., Wu, Y., Mu, J., Goodman, N.: STar: Bootstrapping reasoning with reasoning. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022),https://openreview.net/forum? id=_3ELRdg2sgI

2022

[36] [36]

Advances in Neural Information Processing Systems37, 64735–64772 (2024) 34 S

Zhang, D., Zhoubian, S., Hu, Z., Yue, Y., Dong, Y., Tang, J.: Rest-mcts*: Llm self-training via process reward guided tree search. Advances in Neural Information Processing Systems37, 64735–64772 (2024) 34 S. Sharma et al

2024

[37] [37]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[38] [38]

In: Proceedings of the 37th International Conference on Neural Information Processing Systems

Zheng, G., Yang, B., Tang, J., Zhou, H.Y., Yang, S.: Ddcot: duty-distinct chain-of- thought prompting for multimodal reasoning in language models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23, Curran Associates Inc. (2023)

2023

[39] [39]

arXiv preprint arXiv:2407.06189 (2024)

Zohar, O., Wang, X., Bitton, Y., Szpektor, I., Yeung-Levy, S.: Video-star: Self- training enables video instruction tuning with any supervision. arXiv preprint arXiv:2407.06189 (2024)

work page arXiv 2024