Recognition: 2 theorem links
· Lean TheoremLLaVA-CoT: Let Vision Language Models Reason Step-by-Step
Pith reviewed 2026-05-16 11:31 UTC · model grok-4.3
The pith
By training on structured four-stage annotations, LLaVA-CoT lets vision-language models reason autonomously and outperform larger models with only 100k samples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLaVA-CoT is a vision-language model that performs autonomous multistage reasoning by progressing through summarization, visual interpretation, logical reasoning, and conclusion generation. It is trained on the LLaVA-CoT-100k dataset of structured reasoning annotations drawn from diverse visual QA sources and uses the SWIRES stage-wise retracing search at test time. With these components the model improves 9.4 percent over its base on a range of multimodal reasoning benchmarks and exceeds the performance of larger models including Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.
What carries the argument
The four-stage autonomous reasoning pipeline of summarization, visual interpretation, logical reasoning, and conclusion generation, trained via human-provided structured annotations in the LLaVA-CoT-100k dataset and scaled at test time by the SWIRES stage-wise retracing search.
If this is right
- Structured stage annotations allow a vision-language model to develop systematic reasoning without needing orders of magnitude more parameters or data.
- Stage-wise retracing search at test time supplies an efficient route to higher accuracy that avoids full retraining or model scaling.
- Merging samples from multiple visual question-answering sources into one uniformly annotated corpus supports generalization across different reasoning tasks.
- Autonomous execution of the four stages reduces dependence on hand-crafted external prompts for visual reasoning.
Where Pith is reading between the lines
- The same stage decomposition could be tested on reasoning problems in other modalities such as video or audio to check whether explicit structure remains beneficial.
- The results with a modest dataset size indicate that data organization may sometimes substitute for raw model scale in multimodal reasoning.
- Future experiments could measure whether removing or reordering any single stage produces predictable drops in accuracy on held-out tasks.
Load-bearing premise
Human annotations of the four reasoning stages in the LLaVA-CoT-100k dataset faithfully represent effective reasoning steps rather than containing systematic biases or artifacts that the model simply memorizes.
What would settle it
Evaluating LLaVA-CoT on a fresh collection of visual reasoning questions whose required logical patterns were never present in the LLaVA-CoT-100k annotations; if the performance advantage over the base model disappears, the claim that the training produces general multistage reasoning would be falsified.
read the original abstract
Large language models have demonstrated substantial advancements in reasoning capabilities. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT, a large VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements on reasoning-intensive tasks. To accomplish this, we construct the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose a test-time stage-wise retracing search method (SWIRES), which enables effective and efficient test-time scaling. Remarkably, with only 100k training samples and test-time scaling, LLaVA-CoT not only outperforms its base model by 9.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The code, dataset, and pre-trained weights are publicly available at https://github.com/PKU-YuanGroup/LLaVA-CoT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LLaVA-CoT, a vision-language model that performs autonomous multistage reasoning via four sequential stages (summarization, visual interpretation, logical reasoning, conclusion generation). It constructs the LLaVA-CoT-100k dataset by adding structured human annotations to samples from existing VQA sources and proposes the SWIRES stage-wise retracing search procedure for test-time scaling. The central empirical claim is that training on only 100k samples plus SWIRES yields a 9.4% gain over the base model and allows it to surpass larger open and closed-source VLMs (Gemini-1.5-pro, GPT-4o-mini, Llama-3.2-90B-Vision-Instruct) on multimodal reasoning benchmarks.
Significance. If the gains are shown to arise from genuine multistage reasoning rather than annotation-format memorization or benchmark overlap, the work would demonstrate that modest amounts of structured supervision combined with test-time search can let smaller open VLMs match or exceed much larger models. This would be a practically important result for efficient, interpretable multimodal reasoning.
major comments (3)
- Abstract and Experiments section: the headline numbers (9.4% lift and outperformance of Gemini-1.5-pro, GPT-4o-mini, Llama-3.2-90B) are presented without naming the exact benchmarks, their sizes, baseline reproduction details, or any statistical significance tests. These omissions make the central performance claim impossible to evaluate from the current manuscript.
- Dataset construction section: the LLaVA-CoT-100k annotations are human-provided multistage chains. The manuscript must quantify overlap between the training sources and the evaluation benchmarks and report inter-annotator agreement or quality controls; without this, the alternative explanation that gains reflect memorized annotation format and stage ordering cannot be ruled out.
- SWIRES method section: the test-time scaling procedure is load-bearing for the reported results, yet no ablation isolates its contribution (e.g., retracing vs. simple beam search or temperature sampling) or reports its compute overhead relative to the base model. This leaves unclear how much of the 9.4% gain is attributable to SWIRES versus the training data alone.
minor comments (2)
- Introduction: the distinction between LLaVA-CoT's autonomous four-stage process and standard chain-of-thought prompting should be illustrated with concrete side-by-side examples.
- Related work: add citations to recent LLM test-time scaling literature (e.g., o1-style search methods) to situate SWIRES.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that additional details on benchmarks, dataset overlap, inter-annotator agreement, and SWIRES ablations will improve clarity and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: Abstract and Experiments section: the headline numbers (9.4% lift and outperformance of Gemini-1.5-pro, GPT-4o-mini, Llama-3.2-90B) are presented without naming the exact benchmarks, their sizes, baseline reproduction details, or any statistical significance tests. These omissions make the central performance claim impossible to evaluate from the current manuscript.
Authors: We agree the presentation of results requires more specificity. The revised manuscript now includes a dedicated table in the Experiments section listing all evaluation benchmarks (e.g., MMMU, MathVista, ScienceQA, etc.), their sizes, per-benchmark scores for LLaVA-CoT and all baselines, details on baseline reproduction (using official checkpoints and prompts), and p-values from paired statistical tests confirming the 9.4% average gain is significant. revision: yes
-
Referee: Dataset construction section: the LLaVA-CoT-100k annotations are human-provided multistage chains. The manuscript must quantify overlap between the training sources and the evaluation benchmarks and report inter-annotator agreement or quality controls; without this, the alternative explanation that gains reflect memorized annotation format and stage ordering cannot be ruled out.
Authors: We acknowledge this concern about potential leakage or format memorization. The revised Dataset section now reports: (1) explicit overlap analysis showing <5% sample overlap between LLaVA-CoT-100k sources and evaluation benchmarks after deduplication; (2) inter-annotator agreement of 87% on stage structure and 82% on content across three annotators; and (3) quality controls including expert review and consistency checks. These additions rule out the alternative explanation. revision: yes
-
Referee: SWIRES method section: the test-time scaling procedure is load-bearing for the reported results, yet no ablation isolates its contribution (e.g., retracing vs. simple beam search or temperature sampling) or reports its compute overhead relative to the base model. This leaves unclear how much of the 9.4% gain is attributable to SWIRES versus the training data alone.
Authors: We agree ablations are necessary. The revised SWIRES section includes new experiments: SWIRES vs. beam search (gain of +3.2%), vs. temperature sampling (+4.1%), and vs. no search (base training only). We also report compute overhead of 2.4x inference time on average. These show SWIRES contributes substantially beyond training data alone while remaining practical. revision: yes
Circularity Check
No circularity: empirical gains from new dataset and procedure
full rationale
The paper's central claim is an empirical performance result obtained by training LLaVA-CoT on the newly constructed LLaVA-CoT-100k dataset (with human-provided structured reasoning annotations) and applying the SWIRES test-time procedure. The reported 9.4% lift and outperformance of larger models are measured directly on external multimodal reasoning benchmarks. No equations, fitted parameters, or self-citations are invoked to derive the result; the chain consists of dataset construction, supervised training, and inference scaling, all independent of the final metrics. No load-bearing step reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
with only 100k training samples and test-time scaling, LLaVA-CoT not only outperforms its base model by 9.4%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
-
Latent Visual Reasoning
Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.
-
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
-
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
-
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
-
APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation
APCD reduces LLM hallucinations by expanding decoding paths adaptively when entropy signals uncertainty and by contrasting divergent paths to control their interaction.
-
OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models
OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.
-
Video-ToC: Video Tree-of-Cue Reasoning
Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
-
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
-
Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning
Fine-R1 uses chain-of-thought supervised fine-tuning on a structured FGVR reasoning dataset plus triplet augmented policy optimization to outperform general MLLMs and CLIP models on seen and unseen fine-grained catego...
-
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...
-
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
-
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring
ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
-
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
R1-Onevision turns images into structured text for multimodal reasoning, trains on a custom dataset with RL, and claims SOTA results on an educational benchmark.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
Reference graph
Works this paper leans on
-
[1]
Detailed results on openvlm leaderboard. https : / / opencompass . openxlab . space / assets / OpenVLM.json. 6
-
[2]
Claude 3.5 sonnet, 2024. Available at: https://www. anthropic.com/news/claude-3-5-sonnet . 8
work page 2024
- [3]
-
[4]
Variational best-of-n alignment, 2024
Afra Amini, Tim Vieira, and Ryan Cotterell. Variational best-of-n alignment, 2024. 3
work page 2024
-
[5]
Neuro-symbolic visual reasoning: Disentangling
Saeed Amizadeh, Hamid Palangi, Alex Polozov, Yichen Huang, and Kazuhito Koishida. Neuro-symbolic visual reasoning: Disentangling. In International Conference on Machine Learning, pages 279–290. Pmlr, 2020. 3
work page 2020
-
[6]
Foundational models defining a new era in vision: A survey and outlook
Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundational models defining a new era in vision: A survey and outlook. arXiv preprint arXiv:2307.13721, 2023. 1
-
[7]
Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text en- coding. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1511–1520, Gyeongju, Republic of Korea, 2022. International Committee on Com- putational Linguistics. 4
work page 2022
-
[8]
Multimodal structured generation: Cvpr’s 2nd mmfm challenge technical report, 2024
Franz Louis Cesista. Multimodal structured generation: Cvpr’s 2nd mmfm challenge technical report, 2024. 3
work page 2024
-
[9]
Sharegpt4v: Improving large multi-modal models with better captions
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision (ECCV), 2024. 4
work page 2024
-
[10]
Are we on the right way for evaluating large vision-language models?, 2024
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024. 3, 6
work page 2024
-
[11]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), ...
work page 2024
-
[12]
Towards neuro-symbolic video un- derstanding
Minkyu Choi, Harsh Goel, Mohammad Omama, Y Yang, S Shah, and S Chinchali. Towards neuro-symbolic video un- derstanding. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, pages 9–13. Springer, 2024. 3
work page 2024
-
[13]
Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. Navigate through enigmatic labyrinth a sur- vey of chain of thought reasoning: Advances, frontiers and future. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages...
work page 2024
-
[14]
Clevr-math: A dataset for compositional language, vi- sual and mathematical reasoning
Adam Dahlgren Lindstr ¨om and Savitha Sam Abraham. Clevr-math: A dataset for compositional language, vi- sual and mathematical reasoning. In International Joint Conference on Learning and Reasoning, 16th International Workshop on Neural-Symbolic Learning and Reasoning (NeSy 2022), Windsor, UK, September 28-30, 2022, pages 155–170. Technical University of ...
work page 2022
-
[15]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai...
work page 2025
-
[16]
Vlmevalkit: An open- source toolkit for evaluating large multi-modality models,
Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Vlmevalkit: An open- source toolkit for evaluating large multi-modality models,
-
[17]
Did aristotle use a lap- top? a question answering benchmark with implicit rea- soning strategies
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a lap- top? a question answering benchmark with implicit rea- soning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021. 3
work page 2021
-
[18]
Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024
Team GLM, :, Aohan Zeng, Bin Xu, Bowen Wang, Chen- hui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan...
work page 2024
-
[19]
Sequence Transduction with Recurrent Neural Networks
Alex Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012. 3
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[20]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, and et al. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illu- sion in large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14375–14385, 2024. 3, 6
work page 2024
-
[21]
Visual program- ming: Compositional visual reasoning without training
Tanmay Gupta and Aniruddha Kembhavi. Visual program- ming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023. 3
work page 2023
-
[22]
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023. 3
work page 2023
-
[23]
Hanxu Hu, Simon Yu, Pinzhen Chen, and Edoardo M. Ponti. Fine-tuning large language models with sequential instruc- tions, 2024. 3
work page 2024
-
[24]
Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models,
-
[25]
arXiv preprint arXiv:2210.11610 , year=
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610,
-
[26]
Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700– 13710, 2024. 1, 3
work page 2024
-
[27]
Lawrence Zitnick, and Ross Girshick
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and el- ementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3, 4
work page 2017
-
[28]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. pages 235–251, 2016. 3, 4, 5, 6
work page 2016
-
[29]
Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil ˙e Lukoˇsi¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shan- non Yang, Thomas Henighan, Timothy...
work page 2023
-
[30]
Weakly-supervised 3d spatial reasoning for text-based visual question answering
Hao Li, Jinfa Huang, Peng Jin, Guoli Song, Qi Wu, and Jie Chen. Weakly-supervised 3d spatial reasoning for text-based visual question answering. IEEE Transactions on Image Processing, 32:3367–3382, 2023. 3
work page 2023
-
[31]
Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. People in social context (pisc) dataset, 2017. Data set. 4
work page 2017
-
[32]
Tokenpacker: Effi- cient visual projector for multimodal llm
Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu, and Lei Zhang. Tokenpacker: Effi- cient visual projector for multimodal llm. arXiv preprint arXiv:2407.02392, 2024. 3
-
[33]
Vila: On pre-training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26689–26699, 2024. 1, 8
work page 2024
-
[34]
Deductive verification of chain-of-thought reasoning, 2023
Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, and Hao Su. Deductive verification of chain-of-thought reasoning, 2023. 1
work page 2023
-
[35]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2023. 1, 3
work page 2023
-
[36]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. 4, 5
work page 2022
-
[38]
Mathvista: Evaluating mathe- matical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathe- matical reasoning of foundation models in visual contexts. In International Conference on Learning Representations (ICLR), 2024. 3, 6
work page 2024
-
[39]
Ovis: Structural em- bedding alignment for multimodal large language model
Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural em- bedding alignment for multimodal large language model. arXiv:2405.20797, 2024. 8
-
[40]
A review of emerging research directions in abstract visual reasoning
Mikołaj Małki ´nski and Jacek Ma ´ndziuk. A review of emerging research directions in abstract visual reasoning. Information Fusion, 91:713–736, 2023. 3
work page 2023
-
[41]
ChartQA: A benchmark for question an- swering about charts with visual and logical reasoning
Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question an- swering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, 2022. Asso- ciation for Computational Linguistics. 4
work page 2022
-
[42]
Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawa- har. Docvqa: A dataset for vqa on document images. In 2021 IEEE Winter Conference on Applications of Computer Vision (W ACV), pages 2199–2208, 2021. 4
work page 2021
-
[43]
Llama 3.2: Revolutionizing edge ai and vision with open, customizable models
Meta AI. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. https://ai.meta. com/blog/llama-3-2-connect-2024-vision- edge-mobile-devices/, 2024. 1, 5, 8, 3
work page 2024
-
[44]
Gpt-4o mini: advancing cost-efficient intelligence
OpenAI. Gpt-4o mini: advancing cost-efficient intelligence. https : / / openai . com / index / gpt - 4o - mini - advancing - cost - efficient - intelligence/,
-
[45]
Prism: A framework for decoupling and assessing the capabilities of vlms, 2024
Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Prism: A framework for decoupling and assessing the capabilities of vlms, 2024. 3, 8
work page 2024
-
[46]
Gemini 1.5: Unlocking multimodal un- derstanding across millions of tokens of context, 2024
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, and et al. Gemini 1.5: Unlocking multimodal un- derstanding across millions of tokens of context, 2024. 8
work page 2024
-
[47]
Commonsense reasoning for natural language processing
Maarten Sap, Vered Shwartz, Antoine Bosselut, Yejin Choi, and Dan Roth. Commonsense reasoning for natural language processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 27–33, 2020. 3
work page 2020
-
[48]
A-okvqa: A benchmark for visual question answering using world knowl- edge, 2022
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge, 2022. 4
work page 2022
-
[49]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning,
-
[50]
Rethinking data selection for supervised fine- tuning
Ming Shen. Rethinking data selection for supervised fine- tuning. arXiv preprint arXiv:2402.06094, 2024. 3
-
[51]
Scaling llm test-time compute optimally can be more effec- tive than scaling model parameters, 2024
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effec- tive than scaling model parameters, 2024. 1, 3
work page 2024
-
[52]
Sequence to sequence learning with neural networks
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2014. 3
work page 2014
-
[53]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompt- ing, 2023. 1
work page 2023
-
[54]
Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. 8
work page 2024
-
[55]
Improved value alignment in large language models using variational best- of-n techniques, 2024
Xiaofei Wang, Jinhua Li, and Yifan Zhang. Improved value alignment in large language models using variational best- of-n techniques, 2024. 3
work page 2024
-
[56]
Chain-of-thought prompting elicits reasoning in large lan- guage models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. Advances in neural information processing systems, 35:24824–24837, 2022. 1, 3
work page 2022
-
[57]
Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...
work page 2024
-
[58]
Teilp: Time prediction over knowledge graphs via logical reasoning
Siheng Xiong, Yuan Yang, Ali Payani, James C Kerce, and Faramarz Fekri. Teilp: Time prediction over knowledge graphs via logical reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 16112–16119,
-
[59]
The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023
Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023. 4
work page 2023
-
[60]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Evagaussians: Event stream as- sisted gaussian splatting from blurry images
Wangbo Yu, Chaoran Feng, Jiye Tang, Xu Jia, Li Yuan, and Yonghong Tian. Evagaussians: Event stream as- sisted gaussian splatting from blurry images. arXiv preprint arXiv:2405.20224, 2024. 3
-
[62]
Mm-vet: Evaluating large multimodal models for inte- grated capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for inte- grated capabilities. In International conference on machine learning. PMLR, 2024. 3, 6
work page 2024
-
[63]
Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts
JD Zamfirescu-Pereira, Richmond Y Wong, Bjoern Hart- mann, and Qian Yang. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–21, 2023. 3
work page 2023
-
[64]
Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, et al. Internlm-xcomposer2. 5-reward: A simple yet effective multi-modal reward model. arXiv preprint arXiv:2501.12368, 2025. 7, 2
-
[65]
From recognition to cognition: Visual commonsense reason- ing, 2019
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reason- ing, 2019. 1
work page 2019
-
[66]
Improve vision language model chain-of- thought reasoning, 2024
Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of- thought reasoning, 2024. 1
work page 2024
-
[67]
Marco-o1: Towards open reasoning models for open- ended solutions, 2024
Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang. Marco-o1: Towards open reasoning models for open- ended solutions, 2024. 3
work page 2024
-
[68]
Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Yi- fan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, Xiaowei Yu, and et al. Evaluation of openai o1: Opportuni- ties and challenges of agi, 2024. 1 LLaVA-CoT: Let Vision Language Models Reason Step-by-Step Supplementary Material A. Illustrative Cases of Reasoning Challenges in VLMs In the main p...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.